IRT - Item response Theory

akdhamija 13,613 views 136 slides Sep 09, 2009
Slide 1
Slide 1 of 136
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94
Slide 95
95
Slide 96
96
Slide 97
97
Slide 98
98
Slide 99
99
Slide 100
100
Slide 101
101
Slide 102
102
Slide 103
103
Slide 104
104
Slide 105
105
Slide 106
106
Slide 107
107
Slide 108
108
Slide 109
109
Slide 110
110
Slide 111
111
Slide 112
112
Slide 113
113
Slide 114
114
Slide 115
115
Slide 116
116
Slide 117
117
Slide 118
118
Slide 119
119
Slide 120
120
Slide 121
121
Slide 122
122
Slide 123
123
Slide 124
124
Slide 125
125
Slide 126
126
Slide 127
127
Slide 128
128
Slide 129
129
Slide 130
130
Slide 131
131
Slide 132
132
Slide 133
133
Slide 134
134
Slide 135
135
Slide 136
136

About This Presentation

No description available for this slideshow.


Slide Content

The Basics of IRTThe Basics of IRT
byby
AKAKDhamijaDhamija
©©
(([email protected]@gmail.com ))
www.geocities.com/a_k_dhamijawww.geocities.com/a_k_dhamija
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
11

•What is IRT?
•The Item Characteristic Curve
•Item Characteristic Curve
Models
•Estimating Item Parameters
•The Test Characteristic Curve
•Estimating an Examinee’s
Ability
•The Information Function
•Test Calibration
•Characteristics of a Test
•Computer Adaptive Test
CoverageCoverage
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
22
•What is IRT?
•The Item Characteristic Curve
•Item Characteristic Curve
Models
•Estimating Item Parameters
•The Test Characteristic Curve
•Estimating an Examinee’s
Ability
•The Information Function
•Test Calibration
•Characteristics of a Test
•Computer Adaptive Test

•Classical Theory
•Dependence of Item Statistics on sample of Respondents
•Dependence of Respondents’ scores on choice of items
•Assumes equal errors of measurement at all levels of ability
•No Modeling of data at item level
•Items and Respondents at different scale
•Difficult to compare scores across two different tests
because not on same scale
What is IRT?What is IRT?
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
33
•Classical Theory
•Dependence of Item Statistics on sample of Respondents
•Dependence of Respondents’ scores on choice of items
•Assumes equal errors of measurement at all levels of ability
•No Modeling of data at item level
•Items and Respondents at different scale
•Difficult to compare scores across two different tests
because not on same scale

•IRT
•Links observable respondent performance to unobservable
traits
•Theory is general
•One or more abilities or traits
•Various assumptions / models
•Binary / polytomous data
•At specific model level fit can be addressed
DIF/DTF LOGISTIC / MULTIDIMENSIONAL MODEL
What is IRT?What is IRT?
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
44
•IRT
•Links observable respondent performance to unobservable
traits
•Theory is general
•One or more abilities or traits
•Various assumptions / models
•Binary / polytomous data
•At specific model level fit can be addressed
DIF/DTF LOGISTIC / MULTIDIMENSIONAL MODEL

•Specific IRT model assumptions
•Dominant First factor (Multidimensional models exist too)
•No dependency between items
•Mathematical form of ICC linking performance on items and
trait measured by the instrument
What is IRT?What is IRT?
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
55
•Specific IRT model assumptions
•Dominant First factor (Multidimensional models exist too)
•No dependency between items
•Mathematical form of ICC linking performance on items and
trait measured by the instrument

•IRT benefits
•Item parameters estimation is independent of respondent
samples
•Trait estimation is independent of particular choice of items
(invaluable property for CAT)
•Error of measurement for each respondent
•Item level modeling allows for “optimal assessments”
•Items and respondents calibrated on same scale
What is IRT?What is IRT?
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
66
•IRT benefits
•Item parameters estimation is independent of respondent
samples
•Trait estimation is independent of particular choice of items
(invaluable property for CAT)
•Error of measurement for each respondent
•Item level modeling allows for “optimal assessments”
•Items and respondents calibrated on same scale

•IRT Limitations
•Practitioners lack expertise
•IRT SW are not straight forward
•Large samples are more helpful for estimation
•Doesn’t address construct definition / domain convergence
What is IRT?What is IRT?
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
77
•IRT Limitations
•Practitioners lack expertise
•IRT SW are not straight forward
•Large samples are more helpful for estimation
•Doesn’t address construct definition / domain convergence

•Many choices of IRT Models
•1PL , 2PL , 3PL and Normal Ogive Models (0-1 data)
•Partial Credit , Generalized Partial Credit , Graded Response
, Nominal Response Models (Polytomous data)
•Multidimensional Logistic Models (0-1 data)
•Estimation
•Marginal MLE , Bayesian etc.
•SW
•Bilog-MG ,Parscale , Multilog ,Conquest , Winsteps
What is IRT?What is IRT?
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
88
•Many choices of IRT Models
•1PL , 2PL , 3PL and Normal Ogive Models (0-1 data)
•Partial Credit , Generalized Partial Credit , Graded Response
, Nominal Response Models (Polytomous data)
•Multidimensional Logistic Models (0-1 data)
•Estimation
•Marginal MLE , Bayesian etc.
•SW
•Bilog-MG ,Parscale , Multilog ,Conquest , Winsteps

•Unobservable, or latent, trait
•Generic term ‘ability’ in IRT
•Interval scale of ability
•Theoretical values [-∞,+∞]
•Practical Values [-3,+3]
•Ideally free response items
•Difficult to score reliably
The Item Characteristic CurveThe Item Characteristic Curve
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
99
•Unobservable, or latent, trait
•Generic term ‘ability’ in IRT
•Interval scale of ability
•Theoretical values [-∞,+∞]
•Practical Values [-3,+3]
•Ideally free response items
•Difficult to score reliably

•Data Preparation
•Raw data recoded
•Dichotomously scored items
(For polytomous data :
Samejima's Graded Response model)
•Investigating dimensionality
•examine the eigenvalues following
Principal Axis Factoring (PAF)
If the data are dichotomous, factor
analyze tetrachoric correlations
Assume continuum underlies item
responses
The Item Characteristic CurveThe Item Characteristic Curve
Dominant
first factor
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
1010
•Data Preparation
•Raw data recoded
•Dichotomously scored items
(For polytomous data :
Samejima's Graded Response model)
•Investigating dimensionality
•examine the eigenvalues following
Principal Axis Factoring (PAF)
If the data are dichotomous, factor
analyze tetrachoric correlations
Assume continuum underlies item
responses
Dominant
first factor

•Classical IAT
•Raw score is sum of scores of item
•IRT
•Emphasis on individual items
•Assumption : Ability score (θ)
•P(θ) is the probability to give a correct answer
The Item Characteristic CurveThe Item Characteristic Curve
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
1111
•Classical IAT
•Raw score is sum of scores of item
•IRT
•Emphasis on individual items
•Assumption : Ability score (θ)
•P(θ) is the probability to give a correct answer

•P(θ) is small for low
ability examinees & vice
versa
•A smooth S-shaped
curve
•item characteristic curve
(ICC)
•A basic building block of
IRT
•Two technical
properties
•Item Difficulty
•Item Discrimination
The Item Characteristic CurveThe Item Characteristic Curve
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
1212
•P(θ) is small for low
ability examinees & vice
versa
•A smooth S-shaped
curve
•item characteristic curve
(ICC)
•A basic building block of
IRT
•Two technical
properties
•Item Difficulty
•Item Discrimination

•Item Difficulty
•a location index
•Three ICC with the same
discrimination but
different levels of difficulty
The Item Characteristic CurveThe Item Characteristic Curve
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
1313
•Item Difficulty
•a location index
•Three ICC with the same
discrimination but
different levels of difficulty

•Item Discrimination
•slope atθ = 0
•Differentiating between
abilities below the item
location and those above
the item location.
•Steepness of the ICC in
its middle section
•The steeper the curve,
the better the item can
discriminate
The Item Characteristic CurveThe Item Characteristic Curve
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
1414
•Item Discrimination
•slope atθ = 0
•Differentiating between
abilities below the item
location and those above
the item location.
•Steepness of the ICC in
its middle section
•The steeper the curve,
the better the item can
discriminate

•Caution
•These two properties
•say nothing about validity of item
•simply describe the form ICC
•Figures only show the range-3 to +3
•All ICC become asymptotic to a probability of zero at one tail
and to 1.0 at the other tail
The Item Characteristic CurveThe Item Characteristic Curve
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
1515
•Caution
•These two properties
•say nothing about validity of item
•simply describe the form ICC
•Figures only show the range-3 to +3
•All ICC become asymptotic to a probability of zero at one tail
and to 1.0 at the other tail

•Item with Perfect
discrimination
•item discriminates perfectly
those above and below an
ability score of 1.5
•this item is useless for other
anilities in discrimination
The Item Characteristic CurveThe Item Characteristic Curve
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
1616
•Item with Perfect
discrimination
•item discriminates perfectly
those above and below an
ability score of 1.5
•this item is useless for other
anilities in discrimination

•Difficultywill have the
following levels:
very easy
easy
medium
hard
very hard
Discriminationwill have the
following levels:
none
low
moderate
high
Perfect
The Item Characteristic CurveThe Item Characteristic Curve
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
1717
•Difficultywill have the
following levels:
very easy
easy
medium
hard
very hard
Discriminationwill have the
following levels:
none
low
moderate
high
Perfect

The Item Characteristic CurveThe Item Characteristic Curve
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
1818

The Item Characteristic CurveThe Item Characteristic Curve
Recap
1.When the item discrimination is less than moderate, the item
characteristic curve is nearly linear and appears rather flat.
2.When discrimination is greater than moderate, the item characteristic
curve is S-shaped and rather steep in its middle section.
3.When the item difficulty is less than medium, most of the item
characteristic curve has a probability of correct response that is greater
than .5.
4.When the item difficulty is greater than medium, most of the item
characteristic curve has a probability of correct response less than .5.
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
1919
Recap
1.When the item discrimination is less than moderate, the item
characteristic curve is nearly linear and appears rather flat.
2.When discrimination is greater than moderate, the item characteristic
curve is S-shaped and rather steep in its middle section.
3.When the item difficulty is less than medium, most of the item
characteristic curve has a probability of correct response that is greater
than .5.
4.When the item difficulty is greater than medium, most of the item
characteristic curve has a probability of correct response less than .5.

The Item Characteristic CurveThe Item Characteristic Curve
Recap
5.Regardless of the level of discrimination, item difficulty locates the item
along the ability scale. Thereforeitem difficulty and discrimination are
independent of each other.
6.When an item has no discrimination, all choices of difficulty yield the
same horizontal line at a value of P(θ )= .5. This is because the value of
theitem difficulty for an item with no discrimination is undefined.
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
2020
Recap
5.Regardless of the level of discrimination, item difficulty locates the item
along the ability scale. Thereforeitem difficulty and discrimination are
independent of each other.
6.When an item has no discrimination, all choices of difficulty yield the
same horizontal line at a value of P(θ )= .5. This is because the value of
theitem difficulty for an item with no discrimination is undefined.

Item Characteristic Curve ModelsItem Characteristic Curve Models
•Enough of being intuitive , lets see the rigor needed by a theory
•Three mathematical models for the ICC
•The Logistic Function : 2 Parameter Model (2PL)
P(θ) = 1 = 1
1 +e
-L
1 +e
-a(θ-b)
where:
e is the constant 2.718
b is the difficulty parameter
a is the discrimination parameter
L =a(θ-b) is the logistic deviate (logit) and
θis an ability level.
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
2121
•Enough of being intuitive , lets see the rigor needed by a theory
•Three mathematical models for the ICC
•The Logistic Function : 2 Parameter Model (2PL)
P(θ) = 1 = 1
1 +e
-L
1 +e
-a(θ-b)
where:
e is the constant 2.718
b is the difficulty parameter
a is the discrimination parameter
L =a(θ-b) is the logistic deviate (logit) and
θis an ability level.

Item Characteristic Curve ModelsItem Characteristic Curve Models
•Thedifficulty parameter (b)is defined as the point on the ability
scale at which the probability of correct response to the item is .5.
•Range of b :[-∞,+ ∞] [-3,+3]
•Discrimination parameteris proportional to the slope of the ICC at
θ=b.
•The actual slope atθ=bisa/4, but taking it as ‘a ‘ makes
interpretation easier
•Range of a :[-∞,+ ∞] [-2.80,+2.80]
.Normal Ogive model has different interpretation
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
2222
•Thedifficulty parameter (b)is defined as the point on the ability
scale at which the probability of correct response to the item is .5.
•Range of b :[-∞,+ ∞] [-3,+3]
•Discrimination parameteris proportional to the slope of the ICC at
θ=b.
•The actual slope atθ=bisa/4, but taking it as ‘a ‘ makes
interpretation easier
•Range of a :[-∞,+ ∞] [-2.80,+2.80]
.Normal Ogive model has different interpretation

Item Characteristic Curve ModelsItem Characteristic Curve Models
•Computational Example
•b= 1.0 ,a= .5
θ LogitEXP(-L)1+EXP(-L) P(θ)
-3 -2 7.389 8.389 .12
-2 -1.5 4.482 5.482 .18
-1 -1 2.718 3.718 .27
0 -.5 1.649 2.649 .38
1 0 1 2 .5
2 .5 .607 1.607 .62
3 1 .368 1.368 .73
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
2323
•Computational Example
•b= 1.0 ,a= .5
θ LogitEXP(-L)1+EXP(-L) P(θ)
-3 -2 7.389 8.389 .12
-2 -1.5 4.482 5.482 .18
-1 -1 2.718 3.718 .27
0 -.5 1.649 2.649 .38
1 0 1 2 .5
2 .5 .607 1.607 .62
3 1 .368 1.368 .73

Item Characteristic Curve ModelsItem Characteristic Curve Models
•The Rasch, or One-Parameter, Logistic Model (1 PL)
•a= 1.0 for all items
P(θ) = 1 = 1
1 +e
-L
1 +e
-(θ-b)
Birnbaum (1968)Three Parameter Model (3 PL)
The probability of correct response includes a small component that is
due to guessing.
P(θ) =c + (1-c)
1 +e
-a(θ-b)
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
2424
•The Rasch, or One-Parameter, Logistic Model (1 PL)
•a= 1.0 for all items
P(θ) = 1 = 1
1 +e
-L
1 +e
-(θ-b)
Birnbaum (1968)Three Parameter Model (3 PL)
The probability of correct response includes a small component that is
due to guessing.
P(θ) =c + (1-c)
1 +e
-a(θ-b)

Item Characteristic Curve ModelsItem Characteristic Curve Models
•c is guessing parameter: the probability of getting the item correct by
guessing alone
•value ofcdoes not vary as a function of the ability level
•Range :Theoretically[0,1]Practically [0,.35]
•Definition of the difficulty parameter is changed in 3 PL
•cdefines a floor toP(θ)
•P(θ) = (1+c)/2
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
2525
•c is guessing parameter: the probability of getting the item correct by
guessing alone
•value ofcdoes not vary as a function of the ability level
•Range :Theoretically[0,1]Practically [0,.35]
•Definition of the difficulty parameter is changed in 3 PL
•cdefines a floor toP(θ)
•P(θ) = (1+c)/2

Item Characteristic Curve ModelsItem Characteristic Curve Models
•Discrimination parametera= slope of the item characteristic curve atθ=b
which isa(1-c)/4
-can still be taken as proportional to a
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
2626

Item Characteristic Curve ModelsItem Characteristic Curve Models
•Negative Discrimination
•P(θ) decreases asθincreases
•Can occur in two ways
•The incorrect response to a two-choice
item will have a negative discrimination if
correct response has a positive one
•Something is wrong with the item
•Two curves have same b and |a|
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
2727
•Negative Discrimination
•P(θ) decreases asθincreases
•Can occur in two ways
•The incorrect response to a two-choice
item will have a negative discrimination if
correct response has a positive one
•Something is wrong with the item
•Two curves have same b and |a|

Item Characteristic Curve ModelsItem Characteristic Curve Models
•Interpreting Item Parameter Values
•a: (divide by 1.7 to interpret in normal ogive model)
Verbal labelRange of values
none 0
very low .01-.34
Low .35-.64
moderate .65-1.34
High 1.35-1.69
Very high > 1.70
Perfect + infinity
b: difficult job sinceeasy and hard are relative terms
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
2828
•Interpreting Item Parameter Values
•a: (divide by 1.7 to interpret in normal ogive model)
Verbal labelRange of values
none 0
very low .01-.34
Low .35-.64
moderate .65-1.34
High 1.35-1.69
Very high > 1.70
Perfect + infinity
b: difficult job sinceeasy and hard are relative terms

Item Characteristic Curve ModelsItem Characteristic Curve Models
•In classical test theory, ‘b’ was defined relative to a group of
examinees
•Same item could be easy for one group and hard for another group.
•Under IRT, ‘b’ isθwhereP(θ)is .5 for 1-PL & 2-PL models and (1 +c)/2
for a 3-PL model.
cis interpreted directly as a probability.
c= .12 means that at allθ,P(θ)by guessing alone is .12.
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
2929
•In classical test theory, ‘b’ was defined relative to a group of
examinees
•Same item could be easy for one group and hard for another group.
•Under IRT, ‘b’ isθwhereP(θ)is .5 for 1-PL & 2-PL models and (1 +c)/2
for a 3-PL model.
cis interpreted directly as a probability.
c= .12 means that at allθ,P(θ)by guessing alone is .12.

Item Characteristic Curve ModelsItem Characteristic Curve Models
•The verbal labels reflect only midpoint of scale
•item difficulty tells where the item functions on the ability scale
•The slope of the ICC (‘a’) is at maximum at an ability level corresponding
to the item difficulty
•The item does best in distinguishing between examinees in the
neighborhood of its ability level
•An item whose difficulty is-1 functions among the lower ability
examinees
•A value of +1 denotes an item that functions among higher ability
examinees
•So ability is a location parameter
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
3030
•The verbal labels reflect only midpoint of scale
•item difficulty tells where the item functions on the ability scale
•The slope of the ICC (‘a’) is at maximum at an ability level corresponding
to the item difficulty
•The item does best in distinguishing between examinees in the
neighborhood of its ability level
•An item whose difficulty is-1 functions among the lower ability
examinees
•A value of +1 denotes an item that functions among higher ability
examinees
•So ability is a location parameter

Item Characteristic Curve ModelsItem Characteristic Curve Models
RECAP
1.Under the 1-PL model, the slope is always the same; onlythe
location of the item changes.
2.Under the 2-PL & 3-PL models, the value ofamust become quite large
(>1.7) before the curve is very steep.
3.Under 1-PL & 2-PL and a large positive value ofb :lower tail approach
zero. Under 3-PL it approachesc.
4.cis not apparent whenb< 0 and a < 1.0.
5.Under all models, curves with a negative value ofaare the mirror
image of curves with positive value ofa.
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
3131
RECAP
1.Under the 1-PL model, the slope is always the same; onlythe
location of the item changes.
2.Under the 2-PL & 3-PL models, the value ofamust become quite large
(>1.7) before the curve is very steep.
3.Under 1-PL & 2-PL and a large positive value ofb :lower tail approach
zero. Under 3-PL it approachesc.
4.cis not apparent whenb< 0 and a < 1.0.
5.Under all models, curves with a negative value ofaare the mirror
image of curves with positive value ofa.

Item Characteristic Curve ModelsItem Characteristic Curve Models
RECAP
6.Whenb=-3.0, only the upper half of the item characteristic curve
appears on the graph. Whenb= +3.0, only the lower half of the curve
appears on the graph.
7.The slope of the item characteristic curve is the steepest at the ability
level corresponding to the item difficulty. Thus, the difficulty
parameterblocates the point on the ability scale where the item
functions best.
8.Under IRT, ‘b’ isθwhereP(θ) is .5 for 1-PL & 2-PL models and (1 +
c)/2 for a 3-PL model . Only whenc= 0 are these two definitions
equivalent.
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
3232
RECAP
6.Whenb=-3.0, only the upper half of the item characteristic curve
appears on the graph. Whenb= +3.0, only the lower half of the curve
appears on the graph.
7.The slope of the item characteristic curve is the steepest at the ability
level corresponding to the item difficulty. Thus, the difficulty
parameterblocates the point on the ability scale where the item
functions best.
8.Under IRT, ‘b’ isθwhereP(θ) is .5 for 1-PL & 2-PL models and (1 +
c)/2 for a 3-PL model . Only whenc= 0 are these two definitions
equivalent.

•IRT Analysis
-MexamineesinJgroups(θ
j)responds to theNitems
-m
jexaminees within groupj, wherej= 1, 2, 3. . . .J.
-Inj
thgroup , r
jexaminees answer the item correctly.p(θj)=r
j/m
j
Estimating Item parametersEstimating Item parameters
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
3333

•Find ICC that best fits the observed proportions of correct response.
•first select a model for the curve to be fitted.
•Let’s fit 2PL (chi-square goodness-of-fit index to assess fitness)
•Procedure isMaximum Likelihood Estimation(initial values b=0,a=1)
Estimating Item parametersEstimating Item parameters
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
3434
•Find ICC that best fits the observed proportions of correct response.
•first select a model for the curve to be fitted.
•Let’s fit 2PL (chi-square goodness-of-fit index to assess fitness)
•Procedure isMaximum Likelihood Estimation(initial values b=0,a=1)

J [ p(θ
j)-P(θ
j)]
2
Χ
2
=∑m
j =28.88 and the criterion value was 45.91
j=1 P(θ
j)Q(θ
j)
The two-parameter model withb=-.39 anda= 1.27 was a good fit to
the observed proportions of correct response
Estimating Item parametersEstimating Item parameters
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
3535

•TheGroup Invariance of Item Parameters
•Two groups
•Group 1 [b=-3 to-1] //// Group 2 [ b=1 to 3]
Estimating Item parametersEstimating Item parameters
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
3636

•TheGroup Invariance of Item Parameters
•b(1) =b(2) =-.39 anda(1) =a(2)= 1.27
•Combined analysis
Estimating Item parametersEstimating Item parameters
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
3737
•TheGroup Invariance of Item Parameters
•b(1) =b(2) =-.39 anda(1) =a(2)= 1.27
•Combined analysis

•Group Invariance is powerful feature of IRT
•values of theitem parameters are a property of the item, not of the group
that responded to the item.
•Underclassical test theory, just the opposite holds. The item difficulty of
classical theory is the overall proportion of correct response.
•b= 0 may give item difficulty index .3 for low ability group and .8 for high
ability group.
•Clearly, the value of theclassical item difficulty index is not group
invariant.
•item difficulty fhas consistent meaning in IRT
Estimating Item parametersEstimating Item parameters
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
3838
•Group Invariance is powerful feature of IRT
•values of theitem parameters are a property of the item, not of the group
that responded to the item.
•Underclassical test theory, just the opposite holds. The item difficulty of
classical theory is the overall proportion of correct response.
•b= 0 may give item difficulty index .3 for low ability group and .8 for high
ability group.
•Clearly, the value of theclassical item difficulty index is not group
invariant.
•item difficulty fhas consistent meaning in IRT

•Caution
The obtained numerical values will be subject to variation due to
sample size, how well-structured the data is, and the goodness-of-fit of
the curve to the data.
The item must be used to measure the same latent trait for both groups.
An item’s parameters do not retain group invariance when
taken out of context, i.e., when used to measure a different latent trait
or with examinees from a population for which the test is inappropriate.
Estimating Item parametersEstimating Item parameters
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
3939
•Caution
The obtained numerical values will be subject to variation due to
sample size, how well-structured the data is, and the goodness-of-fit of
the curve to the data.
The item must be used to measure the same latent trait for both groups.
An item’s parameters do not retain group invariance when
taken out of context, i.e., when used to measure a different latent trait
or with examinees from a population for which the test is inappropriate.

RECAP
1.Estimated item parameters was usually a good overall fit to theobserved
proportions of correct response.
2.When two groups are employed, the same item characteristic curve willbe
fitted, regardless of the range of ability
3.The numberof examinees at each level does not affect the group-invariance
property.
4.For the item of positive discrimination, the low-ability group involves the
lower left tail of ICC, and the high-ability group involves the upper right tail.
5.The item parameters were group invariant whether or not the ability
ranges of the two groups overlapped.
Estimating Item parametersEstimating Item parameters
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
4040
RECAP
1.Estimated item parameters was usually a good overall fit to theobserved
proportions of correct response.
2.When two groups are employed, the same item characteristic curve willbe
fitted, regardless of the range of ability
3.The numberof examinees at each level does not affect the group-invariance
property.
4.For the item of positive discrimination, the low-ability group involves the
lower left tail of ICC, and the high-ability group involves the upper right tail.
5.The item parameters were group invariant whether or not the ability
ranges of the two groups overlapped.

RECAP
6.it made no difference as to which group was the high-ability group. Thus,
group labeling is not a consideration.
7.The group-invariance principle holds for all three item characteristic curve
models.
8.item parameter estimates are subject to sampling variation.
Estimating Item parametersEstimating Item parameters
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
4141

•Examinee’s raw test score is obtained by adding up the item scores
N
TS
j=∑P
i(θ
j)
i=1
•eg for 4 items test : TS = .73 + .57 + .69 + .62 = 2.61
•Procedure is same for all three models
Test Characteristics CurveTest Characteristics Curve
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
4242
•Examinee’s raw test score is obtained by adding up the item scores
N
TS
j=∑P
i(θ
j)
i=1
•eg for 4 items test : TS = .73 + .57 + .69 + .62 = 2.61
•Procedure is same for all three models

•For 1-PL & 2-PL left tail
approaches zero and right tail N.
•TS = 0/Nc
k=>θ =-∞
•TS = N =>θ =+∞
Test Characteristics CurveTest Characteristics Curve
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
4343

TCC transforms ability scores to true scores.
TCC is a monotonically increasing function. (though shapes may differ–S ,
inc-plat-inc)
In all cases, it will be asymptotic to a value ofNin the upper tail.
TCC depends upon a number of factors, including the number of items, the
ICC model employed, and the values of the item parameters.
Caution :
TCC (similar to ICC) does not depend on distribution of scores.
Test Characteristics CurveTest Characteristics Curve
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
4444
TCC transforms ability scores to true scores.
TCC is a monotonically increasing function. (though shapes may differ–S ,
inc-plat-inc)
In all cases, it will be asymptotic to a value ofNin the upper tail.
TCC depends upon a number of factors, including the number of items, the
ICC model employed, and the values of the item parameters.
Caution :
TCC (similar to ICC) does not depend on distribution of scores.

Interpretation
The ability level corresponding TS = N/2 locates the test
along the ability scale.
No explicit formula for TCC, so no parameters for the curve
Test Characteristics CurveTest Characteristics Curve
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
4545

RECAP
1. Relation of the true score and the ability level.
a. Given an ability level, the corresponding true score can be
found via the test characteristic curve.
b. Given a true score, the corresponding ability level can be
found via the test characteristic curve.
c. Both the true scores and ability are continuous variables.
2. Shape of the test characteristic curve.
a. WhenN= 1, the true score ranges from 0 to 1 and TCC = ICC
Test Characteristics CurveTest Characteristics Curve
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
4646
RECAP
1. Relation of the true score and the ability level.
a. Given an ability level, the corresponding true score can be
found via the test characteristic curve.
b. Given a true score, the corresponding ability level can be
found via the test characteristic curve.
c. Both the true scores and ability are continuous variables.
2. Shape of the test characteristic curve.
a. WhenN= 1, the true score ranges from 0 to 1 and TCC = ICC

RECAP
b. TCC may not be similar to ICC (due to regions of varying
steepness and plateaus). IT reflects a mixture of
item parameter values
c. The ability level at which the mid-true score (N/2) occurs
is an indicator of where the test functions on the ability
scale.
d. When the values of the item difficulties have a
limited range, the steepness of the TCC depends primarily
upon the average value of the item discrimination
parameters.
e. When the values of the item difficulties are spread widely
over the ability scale, the steepness of the TCC will be
reduced.
Test Characteristics CurveTest Characteristics Curve
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
4747
RECAP
b. TCC may not be similar to ICC (due to regions of varying
steepness and plateaus). IT reflects a mixture of
item parameter values
c. The ability level at which the mid-true score (N/2) occurs
is an indicator of where the test functions on the ability
scale.
d. When the values of the item difficulties have a
limited range, the steepness of the TCC depends primarily
upon the average value of the item discrimination
parameters.
e. When the values of the item difficulties are spread widely
over the ability scale, the steepness of the TCC will be
reduced.

RECAP
e. Under a three-parameter model, the lower limit of the true
scores is NC
k.
f. The shape of the test characteristic curve depends upon the
number of items, the ICCmodel and the mix of values of the
item parameters.
3. It would be possible to construct a test characteristic curve that
decreases as ability increases. To do so would require items with
negative discrimination for the correct response to the items. Such a
test would not be considered a good test because the higher an
examinee’s ability level, the lower the score expected for the
examinee.
Test Characteristics CurveTest Characteristics Curve
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
4848
RECAP
e. Under a three-parameter model, the lower limit of the true
scores is NC
k.
f. The shape of the test characteristic curve depends upon the
number of items, the ICCmodel and the mix of values of the
item parameters.
3. It would be possible to construct a test characteristic curve that
decreases as ability increases. To do so would require items with
negative discrimination for the correct response to the items. Such a
test would not be considered a good test because the higher an
examinee’s ability level, the lower the score expected for the
examinee.

•To locate that person on the ability scale(θ = ?)
•Individual’s ability
•Comparisons among individual’s ability
•The list of 1’s and 0’s for theNitems is called the examinee’s item
response vector.
•Use this item response vector and the known item parameters to
estimate the examinee’sθ.
•Maximum likelihood proceduresare used to estimate an examinee’s
ability.
It begins with somea priorivalue for the ability of the
examinee andthe known values of the item parameters. These are
used to compute the probability of correct response to each item for
that examinee.
Estimating Examinee’s AbilityEstimating Examinee’s Ability
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
4949
•To locate that person on the ability scale(θ = ?)
•Individual’s ability
•Comparisons among individual’s ability
•The list of 1’s and 0’s for theNitems is called the examinee’s item
response vector.
•Use this item response vector and the known item parameters to
estimate the examinee’sθ.
•Maximum likelihood proceduresare used to estimate an examinee’s
ability.
It begins with somea priorivalue for the ability of the
examinee andthe known values of the item parameters. These are
used to compute the probability of correct response to each item for
that examinee.

•this procedure is based upon an approach that treats each examinee
separately
N
∑-a
i( u
i–P
i(θ
s))
i=1
Θ
s+1= θ
s+
N
∑ a
i
2
P
i(θ
s)Q
i(θ
s)
i=1
where: θ
sis the estimated ability of the examinee within iteration s
a
iis the discrimination parameter of item i, i = 1, 2, . . . .N
u
iis the response made by the examinee to item i: 1 or 0
P
i(θ
s) is the probability of correct response to item i, under the given
item characteristic curve model, at ability level θ
swithin
iteration s.
Q
i(θ
s) = 1-P
i(θ
s)
Estimating Examinee’s AbilityEstimating Examinee’s Ability
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
5050
•this procedure is based upon an approach that treats each examinee
separately
N
∑-a
i( u
i–P
i(θ
s))
i=1
Θ
s+1= θ
s+
N
∑ a
i
2
P
i(θ
s)Q
i(θ
s)
i=1
where: θ
sis the estimated ability of the examinee within iteration s
a
iis the discrimination parameter of item i, i = 1, 2, . . . .N
u
iis the response made by the examinee to item i: 1 or 0
P
i(θ
s) is the probability of correct response to item i, under the given
item characteristic curve model, at ability level θ
swithin
iteration s.
Q
i(θ
s) = 1-P
i(θ
s)

•A three-item test
b=-1,a= 1.0
b= 0,a= 1.2
b= 1,a= .8
•The examinee’s item responses were:
item response
1 1
2 0
3 1
a prioriestimate of the examinee’s ability is set toθ
s= 1.0
Estimating Examinee’s AbilityEstimating Examinee’s Ability
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
5151
•A three-item test
b=-1,a= 1.0
b= 0,a= 1.2
b= 1,a= .8
•The examinee’s item responses were:
item response
1 1
2 0
3 1
a prioriestimate of the examinee’s ability is set toθ
s= 1.0

First iteration:
itemu P Q a(u-P)a*a(PQ)
1 1 .88 .12 .119.105
2 0 .77 .23 -.922.255
3 1 .5. 5 .400.160
sum -.403.520
∆θ
s=-.403/.520 =-.773,θ
s+1= 1.0-.773 = .227
Seconditeration:
∆θ
s= .066/.674 = .097,θ
s+1= .227 + .097 = .324
Third iteration:
∆θ
s= .0006/.6615 = .0009,θ
s+1= .324 + .0009 = .3249
At this point, the process is terminated because the value of the adjustment
(.0009) is very small. Thus, the examinee’s estimated ability is .33.
Estimating Examinee’s AbilityEstimating Examinee’s Ability
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
5252
First iteration:
itemu P Q a(u-P)a*a(PQ)
1 1 .88 .12 .119.105
2 0 .77 .23 -.922.255
3 1 .5. 5 .400.160
sum -.403.520
∆θ
s=-.403/.520 =-.773,θ
s+1= 1.0-.773 = .227
Seconditeration:
∆θ
s= .066/.674 = .097,θ
s+1= .227 + .097 = .324
Third iteration:
∆θ
s= .0006/.6615 = .0009,θ
s+1= .324 + .0009 = .3249
At this point, the process is terminated because the value of the adjustment
(.0009) is very small. Thus, the examinee’s estimated ability is .33.

Unfortunately,there is no way to know the examinee’s actual ability
parameter.The best one can do is estimate it.
1
SE(θ) =
Sqrt(a
i
2
P
i(θ
s)Q
i(θ
s))
SE(θ) = 1 /sqrt(.6615) = 1.23
The examinee’s ability was not estimated very precisely because the
standard error is very large. This is primarily due to the fact that only
three items were used.
Estimating Examinee’s AbilityEstimating Examinee’s Ability
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
5353
Unfortunately,there is no way to know the examinee’s actual ability
parameter.The best one can do is estimate it.
1
SE(θ) =
Sqrt(a
i
2
P
i(θ
s)Q
i(θ
s))
SE(θ) = 1 /sqrt(.6615) = 1.23
The examinee’s ability was not estimated very precisely because the
standard error is very large. This is primarily due to the fact that only
three items were used.

Two cases where MLE procedure fails
when an examinee answers none of the items correctly, the
corresponding ability estimate is negative infinity.
when an examinee answers all the items in the test correctly,
the corresponding ability estimate is positive infinity.
In both of these cases it is impossible to obtain an ability estimate for
the examinee (the computer literally cannot compute a number as big
as infinity).
Estimating Examinee’s AbilityEstimating Examinee’s Ability
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
5454
Two cases where MLE procedure fails
when an examinee answers none of the items correctly, the
corresponding ability estimate is negative infinity.
when an examinee answers all the items in the test correctly,
the corresponding ability estimate is positive infinity.
In both of these cases it is impossible to obtain an ability estimate for
the examinee (the computer literally cannot compute a number as big
as infinity).

Item Invariance of an Examinee’s Ability Estimate
Different sets of items should yield the same ability estimate,
within sampling variation . In each set, a different part of the
ICC is involved
This principle rests upon two conditions:
all the items measure the same underlying latent trait
the values of all the item parameters are in a common metric.
Estimating Examinee’s AbilityEstimating Examinee’s Ability
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
5555
Item Invariance of an Examinee’s Ability Estimate
Different sets of items should yield the same ability estimate,
within sampling variation . In each set, a different part of the
ICC is involved
This principle rests upon two conditions:
all the items measure the same underlying latent trait
the values of all the item parameters are in a common metric.

Implication of Item Invariance of an Examinee’s Ability Estimate
A test located anywhere along the ability scale can be used to
estimate an examinee’s ability.
An examinee could take a test that is “easy” or a test that is “hard”
and obtain, on the average, the same estimated ability.
This is in sharp contrast to classical test theory, where such an examinee
would get a high test score on the easy test and vice versa
Under item response theory, the examinee’s ability is fixed and invariant
with respect to the items used to measure it.
Estimating Examinee’s AbilityEstimating Examinee’s Ability
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
5656
Implication of Item Invariance of an Examinee’s Ability Estimate
A test located anywhere along the ability scale can be used to
estimate an examinee’s ability.
An examinee could take a test that is “easy” or a test that is “hard”
and obtain, on the average, the same estimated ability.
This is in sharp contrast to classical test theory, where such an examinee
would get a high test score on the easy test and vice versa
Under item response theory, the examinee’s ability is fixed and invariant
with respect to the items used to measure it.

Caution :
An examinee’s ability is fixed relative to a given context.
However, if the examinee received remedial instruction
between testing or if there were carryover (memorizing)
effects, the examinee’s underlying ability level would be
different for each testing.
The item invariance of an examinee’s ability and the group invariance
of an item’s parameters are two facets of what is referred to,
generically, as theinvariance principle of item response theory.
Estimating Examinee’s AbilityEstimating Examinee’s Ability
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
5757
Caution :
An examinee’s ability is fixed relative to a given context.
However, if the examinee received remedial instruction
between testing or if there were carryover (memorizing)
effects, the examinee’s underlying ability level would be
different for each testing.
The item invariance of an examinee’s ability and the group invariance
of an item’s parameters are two facets of what is referred to,
generically, as theinvariance principle of item response theory.

RECAP
1. Distribution of estimated ability.
a. The standard error of the estimates can be quite large when
the items are not located near the ability of the examinee.
c. When the values of the item discrimination indices are large,
the standard error of the ability estimates is small. When the
item discrimination indices are small, the standard error of
the ability estimates is large.
e. The optimum set of items for estimating an examinee’s
ability would have all its item difficulties equal to the
examinee’s ability parameter and have items with large
values for the item discrimination indices.
Estimating Examinee’s AbilityEstimating Examinee’s Ability
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
5858
RECAP
1. Distribution of estimated ability.
a. The standard error of the estimates can be quite large when
the items are not located near the ability of the examinee.
c. When the values of the item discrimination indices are large,
the standard error of the ability estimates is small. When the
item discrimination indices are small, the standard error of
the ability estimates is large.
e. The optimum set of items for estimating an examinee’s
ability would have all its item difficulties equal to the
examinee’s ability parameter and have items with large
values for the item discrimination indices.

RECAP
2. Item invariance of the examinee’s ability.
a. The different sets of items yielded values of estimated
ability that were near the examinee’s actual ability level.
b. The mean value of these estimates generally was a close
approximation of the examinee’s ability parameter.
3. Each examinee has an ability score (parameter value) that locates
that person (estimated ability) on the scale.
Estimating Examinee’s AbilityEstimating Examinee’s Ability
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
5959
RECAP
2. Item invariance of the examinee’s ability.
a. The different sets of items yielded values of estimated
ability that were near the examinee’s actual ability level.
b. The mean value of these estimates generally was a close
approximation of the examinee’s ability parameter.
3. Each examinee has an ability score (parameter value) that locates
that person (estimated ability) on the scale.

Thestatistical meaning of informationis credited to SirR.A. Fisher, who
defined information as the (reciprocal of theσ)precision with which a
parameter could be estimated.
1
I = whereσ~ SE(θ)
σ
2
The Information FunctionThe Information Function
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
6060

I(θ)is maximum atθ=-1.0 and is about 3 for the ability range of-2<=θ< =0.
Within this range, ability is estimated with some precision.
I(θ)does not depend upon the distribution of examinees over the ability
scale.
Ideal I(θ)would be a horizontal line at some large value : hard to achieve
Precision with which an examinee’s ability is estimated depends upon where
the examinee’s ability is located on the ability scale.–Implication for test
construction
The Information FunctionThe Information Function
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
6161
I(θ)is maximum atθ=-1.0 and is about 3 for the ability range of-2<=θ< =0.
Within this range, ability is estimated with some precision.
I(θ)does not depend upon the distribution of examinees over the ability
scale.
Ideal I(θ)would be a horizontal line at some large value : hard to achieve
Precision with which an examinee’s ability is estimated depends upon where
the examinee’s ability is located on the ability scale.–Implication for test
construction

Item information function
I
i(θ ), whereiindexes the item.
Will be small as single item is involved.
The Information FunctionThe Information Function
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
6262

An item measures ability withgreatest precision at the ability level
corresponding to the item’s difficulty parameter.
The amount of item information decreases as the ability level departs from
the item difficulty and approaches zero at the extremes of the ability scale.
Test Information Function
N
I(θ )=∑I
i(θ )
i=1
The Information FunctionThe Information Function
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
6363
An item measures ability withgreatest precision at the ability level
corresponding to the item’s difficulty parameter.
The amount of item information decreases as the ability level departs from
the item difficulty and approaches zero at the extremes of the ability scale.
Test Information Function
N
I(θ )=∑I
i(θ )
i=1

I(θ )is much higher thanI
i(θ )
Longer the test , higherI(θ )and greater precision in mesauring
examinee’s ability than will shorter tests.
Thus, ability is estimated with some precision near the center of the
ability scale.
I(θ )tells how well the test is doing in estimating ability over the whole
range of ability scores.
While the ideal test information function often may be a horizontal
line, it may not be the best for a specific purpose.
The Information FunctionThe Information Function
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
6464
I(θ )is much higher thanI
i(θ )
Longer the test , higherI(θ )and greater precision in mesauring
examinee’s ability than will shorter tests.
Thus, ability is estimated with some precision near the center of the
ability scale.
I(θ )tells how well the test is doing in estimating ability over the whole
range of ability scores.
While the ideal test information function often may be a horizontal
line, it may not be the best for a specific purpose.

For example, if you were interested in constructing a test to award
scholarships, this ideal might not be optimal. In this situation, you would like
to measure ability with considerable precision at ability levels near the ability
used to separate those who will receive the scholarship from those who do
not.
The best test information function in this case would have a peak at the
cutoff score.
Other specialized uses of tests could require other forms of the
test information function.
The Information FunctionThe Information Function
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
6565
For example, if you were interested in constructing a test to award
scholarships, this ideal might not be optimal. In this situation, you would like
to measure ability with considerable precision at ability levels near the ability
used to separate those who will receive the scholarship from those who do
not.
The best test information function in this case would have a peak at the
cutoff score.
Other specialized uses of tests could require other forms of the
test information function.

Information Functions
2 PL:
I
i(θ )= a
i
2
P
i(θ )Q
i(θ )
P
i(θ ) = 1/(1+ EXP (-a
i(θ-b
i)))
Q
i(θ )=1-P
i(θ )
1 PL:
I
i(θ ) = P
i(θ ) Q
i(θ )
Q
i(θ )=1-P
i(θ )
3 PL:
I
i(θ ) = a
i
2 (
Q
i(θ )) (P
i(θ )–c
2
)
(P
i(θ ) ) (1-c
2
)
P
i(θ ) = c + (1-c) (1/(1 + EXP (-a
i(θ-b
i))))
Q
i(θ )=1-P
i(θ )
The Information FunctionThe Information Function
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
6666
Information Functions
2 PL:
I
i(θ )= a
i
2
P
i(θ )Q
i(θ )
P
i(θ ) = 1/(1+ EXP (-a
i(θ-b
i)))
Q
i(θ )=1-P
i(θ )
1 PL:
I
i(θ ) = P
i(θ ) Q
i(θ )
Q
i(θ )=1-P
i(θ )
3 PL:
I
i(θ ) = a
i
2 (
Q
i(θ )) (P
i(θ )–c
2
)
(P
i(θ ) ) (1-c
2
)
P
i(θ ) = c + (1-c) (1/(1 + EXP (-a
i(θ-b
i))))
Q
i(θ )=1-P
i(θ )

Information Functions Example : 3PL (b= 1.0,a= 1.5,c= .2)
θLP
i(θ)Q
i(θ)P
i(θ) Q
i(θ) (P
i(θ)-c) I
i(θ)
-3-6.0.20 .80 3.950 .000 .000
-2-4.5.21 .79 3.785 .000 .001
-1-3.0.24 .76 3.202 .001 .016
0--1.5.35 .65 1.890 .021 .142
10.0 .60.40 .667 .160 .375
21.5.85 .15 .171 .428 .257
33.0.96 .04 .040 .481 .082
General level of the values for the amount of information is lower(because of the
presence of the terms (1-c) and (P
i(θ )-c) )
The maximum occurred at an ability level slightly higher than the value ofb
The Information FunctionThe Information Function
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
6767
Information Functions Example : 3PL (b= 1.0,a= 1.5,c= .2)
θLP
i(θ)Q
i(θ)P
i(θ) Q
i(θ) (P
i(θ)-c) I
i(θ)
-3-6.0.20 .80 3.950 .000 .000
-2-4.5.21 .79 3.785 .000 .001
-1-3.0.24 .76 3.202 .001 .016
0--1.5.35 .65 1.890 .021 .142
10.0 .60.40 .667 .160 .375
21.5.85 .15 .171 .428 .257
33.0.96 .04 .040 .481 .082
General level of the values for the amount of information is lower(because of the
presence of the terms (1-c) and (P
i(θ )-c) )
The maximum occurred at an ability level slightly higher than the value ofb

When they share common values ofaandb, the information functions will be the
same whenc= 0
Whenc> 0, the three-parameter model will always yield less information
Thus, the item information function under a two-parameter model defines the upper
bound for the amount of information under a three-parameter model
This is reasonable, because getting the item correct by guessing should not
enhance the precision with which an ability level is estimated.
The Information FunctionThe Information Function
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
6868
When they share common values ofaandb, the information functions will be the
same whenc= 0
Whenc> 0, the three-parameter model will always yield less information
Thus, the item information function under a two-parameter model defines the upper
bound for the amount of information under a three-parameter model
This is reasonable, because getting the item correct by guessing should not
enhance the precision with which an ability level is estimated.

Computing a Test Information Function: five-item test under 2PL
Item b a
1 -1.0 2.0
2 -0.5 1.5
3 -0.0 1.5
4 0.5 1.5
5 1.0 2.0
θ12 3 4 5 Test Information
-3.071.051.024.012.001.159
-2.420.194.102.051.010.777
-11.000 .490.336.194.071 2.091
0.420.490.563.490.420 2.383
1.071.194.336.4901.0002.091
2 .010.051.102.194.420.777
3 .001.012.024.051.071.159
The Information FunctionThe Information Function
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
6969
Computing a Test Information Function: five-item test under 2PL
Item b a
1 -1.0 2.0
2 -0.5 1.5
3 -0.0 1.5
4 0.5 1.5
5 1.0 2.0
θ12 3 4 5 Test Information
-3.071.051.024.012.001.159
-2.420.194.102.051.010.777
-11.000 .490.336.194.071 2.091
0.420.490.563.490.420 2.383
1.071.194.336.4901.0002.091
2 .010.051.102.194.420.777
3 .001.012.024.051.071.159

The five item discriminations had a symmetrical distribution around a value of 1.5
The five item difficulties had a symmetrical distribution about an ability level of zero
Because of this, the test information function also is symmetric about an ability of
zero
The Information FunctionThe Information Function
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
7070

σ~ SE(θ) = 1/sqrt(I(θ))
A peakedI(θ ) measures ability with unequal precision theabilityscale.
best for estimating the ability of examinees whose abilities fall near
the peak of the test information function
In some tests, theI(θ )is rather flat over some regionof the ability scale
test would be a desirable one for those examinees whose ability falls
in the given range
The maximum amount of test information was 2.383 at an ability level of 0
This translates into a standard error of .65
Roughly that 68 percent of the estimates of this ability level fallbetween-
.65 and +.65 (ability level is estimated with a modest amount of precision)
The Information FunctionThe Information Function
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
7171
σ~ SE(θ) = 1/sqrt(I(θ))
A peakedI(θ ) measures ability with unequal precision theabilityscale.
best for estimating the ability of examinees whose abilities fall near
the peak of the test information function
In some tests, theI(θ )is rather flat over some regionof the ability scale
test would be a desirable one for those examinees whose ability falls
in the given range
The maximum amount of test information was 2.383 at an ability level of 0
This translates into a standard error of .65
Roughly that 68 percent of the estimates of this ability level fallbetween-
.65 and +.65 (ability level is estimated with a modest amount of precision)

RECAP
1.Thegeneral level of the test information function depends upon:
a.The number of items in the test.
b.The average value of the discrimination parameters of the
test items.
c.Both of the above hold for all three item characteristic curve
models.
2.The shape of the test information function depends upon:
a.The distribution of the item difficulties over the ability scale.
b.The distribution and the average value of the discrimination
parameters of the test items.
3.When the item difficulties are clustered closely around a given value,
the test information function is peaked at that point on the ability
scale. The maximum amount of information depends upon the
values of the discrimination parameters.
The Information FunctionThe Information Function
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
7272
RECAP
1.Thegeneral level of the test information function depends upon:
a.The number of items in the test.
b.The average value of the discrimination parameters of the
test items.
c.Both of the above hold for all three item characteristic curve
models.
2.The shape of the test information function depends upon:
a.The distribution of the item difficulties over the ability scale.
b.The distribution and the average value of the discrimination
parameters of the test items.
3.When the item difficulties are clustered closely around a given value,
the test information function is peaked at that point on the ability
scale. The maximum amount of information depends upon the
values of the discrimination parameters.

RECAP
4.When the item difficulties are widely distributed over the abilityscale, the
test information function tends to be flatter than when the
difficulties are tightly clustered.
5.Values ofa< 1.0 result in a low general level of the amount of test
information.
6.Values ofa> 1.7 result in a high general level of the amount of test
information.
7.Under a three-parameter model, values of the guessing parameterc
greater than zero lower the amount of test information at the low-
ability levels. In addition, large values ofcreduce the general level of
the amount of test information.
8.It is difficult to approximate a horizontal test information function. To
do so, the values ofbmust be spread widely over the ability scale,
and the values ofamust be in the moderate to low range and have a
U-shaped distribution.
The Information FunctionThe Information Function
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
7373
RECAP
4.When the item difficulties are widely distributed over the abilityscale, the
test information function tends to be flatter than when the
difficulties are tightly clustered.
5.Values ofa< 1.0 result in a low general level of the amount of test
information.
6.Values ofa> 1.7 result in a high general level of the amount of test
information.
7.Under a three-parameter model, values of the guessing parameterc
greater than zero lower the amount of test information at the low-
ability levels. In addition, large values ofcreduce the general level of
the amount of test information.
8.It is difficult to approximate a horizontal test information function. To
do so, the values ofbmust be spread widely over the ability scale,
and the values ofamust be in the moderate to low range and have a
U-shaped distribution.

Test constructors know beforehand
what trait they want the item to measure and
Ability level of examinees , where the item is designed to function
But
it is not possible to determine the values of the item’s parametersa priori
when a test is administered ,latent trait of examinees is not known
As a resulta major task is
to determine the values of the item parameters and examinee abilities in a
metric for the underlying latent trait.
In IRT, this task is calledTest calibration
Test CalibrationTest Calibration
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
7474
Test constructors know beforehand
what trait they want the item to measure and
Ability level of examinees , where the item is designed to function
But
it is not possible to determine the values of the item’s parametersa priori
when a test is administered ,latent trait of examinees is not known
As a resulta major task is
to determine the values of the item parameters and examinee abilities in a
metric for the underlying latent trait.
In IRT, this task is calledTest calibration

The Test Calibration Process (Birnbaum,1968)
Two stages of maximum likelihood estimation
stage 1: The parameters of the N items in the test are estimated
• The estimated ability of each examinee is treated as if it is expressed in
the true metric of the latent trait .
• Then the parameters of each item in the test are estimated via the
maximum likelihood procedure
Item’s parameters are estimated one at a time (Independence assumption)
Test CalibrationTest Calibration
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
7575
The Test Calibration Process (Birnbaum,1968)
Two stages of maximum likelihood estimation
stage 1: The parameters of the N items in the test are estimated
• The estimated ability of each examinee is treated as if it is expressed in
the true metric of the latent trait .
• Then the parameters of each item in the test are estimated via the
maximum likelihood procedure
Item’s parameters are estimated one at a time (Independence assumption)

The Test Calibration Process (Birnbaum,1968)
Two stages of maximum likelihood estimation
stage 2 : the ability parameters of theMexaminees are estimated
• taking these item parameter estimates, the ability of each examinee is
estimated using the maximum likelihood procedure presented
• the ability estimates are obtained one examinee at a time (independence
assumption).
The two-stage process is repeated until some suitable convergence criterion is
met.
The overall effect is that theparameters of theNtest items and the ability levels of
theMexaminees have been estimated simultaneously,
Test CalibrationTest Calibration
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
7676
The Test Calibration Process (Birnbaum,1968)
Two stages of maximum likelihood estimation
stage 2 : the ability parameters of theMexaminees are estimated
• taking these item parameter estimates, the ability of each examinee is
estimated using the maximum likelihood procedure presented
• the ability estimates are obtained one examinee at a time (independence
assumption).
The two-stage process is repeated until some suitable convergence criterion is
met.
The overall effect is that theparameters of theNtest items and the ability levels of
theMexaminees have been estimated simultaneously,

•It does not yield a unique metric for the ability scale.
•The midpoint and the unit of measurement of the obtained ability scale
are indeterminate
•The metric is unique up to a linear transformation
•It is necessary to “anchor” the metric via arbitrary rules for determining
the midpoint and unit of measurement of the ability scale.
•It is not possible to obtain estimates of the examinee’s ability and of the
item’s parameters in the true metric of the underlying latent trait.
•The best we can do is obtain a metric that depends upon a particular
combination of examinees and test items.
Test CalibrationTest Calibration
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
7777
•It does not yield a unique metric for the ability scale.
•The midpoint and the unit of measurement of the obtained ability scale
are indeterminate
•The metric is unique up to a linear transformation
•It is necessary to “anchor” the metric via arbitrary rules for determining
the midpoint and unit of measurement of the ability scale.
•It is not possible to obtain estimates of the examinee’s ability and of the
item’s parameters in the true metric of the underlying latent trait.
•The best we can do is obtain a metric that depends upon a particular
combination of examinees and test items.

For three different ICC models to choose from , there are several different
customised ways to implement the Birnbaum paradigm.
You have to devise your own implementation
Illustration : BICAL program as implemented by Benjamin Wright and his co-
Workers for 1PL Rasch model
ten-item test administered to a group of 16 examinees
Test CalibrationTest Calibration
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
7878

ITEM RESPONSES
12 3 4 5 6 7 8 9 10RS
01 1 1 2
02 1 1 2
03 11 1 1 1 5
04 11 1 1 4
05 1 1
06 11 1 3
07 1 1 1 1 4
08 1 1 1 1 4
09 1 1 1 1 4
10 1 1 1 3
11 11 11 11 11 1 9
12 11 1 11 11 11 9
13 11 1 1 1 1 6
14 11 1 11 11 11 9
15 11 11 11 1 1 1 9
16 11 1 11 11 11 1 10
Test CalibrationTest Calibration
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
7979
ITEM RESPONSES
12 3 4 5 6 7 8 9 10RS
01 1 1 2
02 1 1 2
03 11 1 1 1 5
04 11 1 1 4
05 1 1
06 11 1 3
07 1 1 1 1 4
08 1 1 1 1 4
09 1 1 1 1 4
10 1 1 1 3
11 11 11 11 11 1 9
12 11 1 11 11 11 9
13 11 1 1 1 1 6
14 11 1 11 11 11 9
15 11 11 11 1 1 1 9
16 11 1 11 11 11 1 10

Examinee 16 answered all of the items so removed from the data set
If item is answered correctly by all of the examinees or by none of the examinees,
its item difficulty parameter cannot be estimated.
FREQUENCY COUNTS FOR EDITED DATA
SCORE ITEM Row
1 2 3 4 56 7 89 10 Total
1 1 1
21 2 1 4
32 1 1 1 1 6
44 1 2 23 1 12 16
51 1 1 1 1 5
61 1 1 1 1 1 6
94 4 2 4 44 4 4 4 236
-----------------------------------------------------------------------------
COL13 8 85 107 7 6 7 374
Test CalibrationTest Calibration
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
8080
Examinee 16 answered all of the items so removed from the data set
If item is answered correctly by all of the examinees or by none of the examinees,
its item difficulty parameter cannot be estimated.
FREQUENCY COUNTS FOR EDITED DATA
SCORE ITEM Row
1 2 3 4 56 7 89 10 Total
1 1 1
21 2 1 4
32 1 1 1 1 6
44 1 2 23 1 12 16
51 1 1 1 1 5
61 1 1 1 1 1 6
94 4 2 4 44 4 4 4 236
-----------------------------------------------------------------------------
COL13 8 85 107 7 6 7 374

•Given the two frequency vectors (Rows and Columns), the estimation process
can be implemented
•Under the Rasch model, the anchoring procedure takes advantage of a = 1
-Unit of measurements for estimated ability is set at 1.
-Item’s difficulty is estimated
•To get midpoint value of ability scale, the mean item difficulty is subtracted from
the value of each item’s difficulty estimate
-Mid point of item difficulty becomes 0 and same is for ability estimates.
•The ability estimate corresponding to each raw test score is obtained in the
second stage using the rescaled item difficulties (as if they were the difficulty
parameters) and the vector of row marginal totals.
-metric is also changed to that of rescaled ability parameters
•The output of this stage is an ability estimate for each raw test score in the data
set
Test CalibrationTest Calibration
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
8181
•Given the two frequency vectors (Rows and Columns), the estimation process
can be implemented
•Under the Rasch model, the anchoring procedure takes advantage of a = 1
-Unit of measurements for estimated ability is set at 1.
-Item’s difficulty is estimated
•To get midpoint value of ability scale, the mean item difficulty is subtracted from
the value of each item’s difficulty estimate
-Mid point of item difficulty becomes 0 and same is for ability estimates.
•The ability estimate corresponding to each raw test score is obtained in the
second stage using the rescaled item difficulties (as if they were the difficulty
parameters) and the vector of row marginal totals.
-metric is also changed to that of rescaled ability parameters
•The output of this stage is an ability estimate for each raw test score in the data
set

At this point, theconvergence of the overall iterative processis checked.
Wright summed the absolute differences between the values of the item
difficulty parameter estimates for two successive iterations of the
paradigm.
If this sum was less than .01, the estimation process was terminated.
If it was greater than .01, then another iteration was performed and the
two stages were done again
Thus, the process of stage one, anchor the metric, stage two, and check for
convergence is repeated until the criterion is met.
When this happens, the current values of the item and ability parameter estimates
are accepted and an ability scale metric has been defined.
Test CalibrationTest Calibration
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
8282
At this point, theconvergence of the overall iterative processis checked.
Wright summed the absolute differences between the values of the item
difficulty parameter estimates for two successive iterations of the
paradigm.
If this sum was less than .01, the estimation process was terminated.
If it was greater than .01, then another iteration was performed and the
two stages were done again
Thus, the process of stage one, anchor the metric, stage two, and check for
convergence is repeated until the criterion is met.
When this happens, the current values of the item and ability parameter estimates
are accepted and an ability scale metric has been defined.

ITEM PARAMETER ESTIMATES
ItemDifficulty
1 -2.37
2 -0.27
3 -0.27
4 +0.98
5 -1.00
6 +0.11
7 +0.11
8 +0.52
9 +0.11
10 +2.06
Because of the anchoring procedures used, these values are actually relative to
the average item difficulty of the test for these examinees.
You can verify thatthe sum of the item difficulties is zero(within rounding errors)
Test CalibrationTest Calibration
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
8383
ITEM PARAMETER ESTIMATES
ItemDifficulty
1 -2.37
2 -0.27
3 -0.27
4 +0.98
5 -1.00
6 +0.11
7 +0.11
8 +0.52
9 +0.11
10 +2.06
Because of the anchoring procedures used, these values are actually relative to
the average item difficulty of the test for these examinees.
You can verify thatthe sum of the item difficulties is zero(within rounding errors)

ABILITY ESTIMATION
Examinee Obtained Raw Score
1 -1.50 2
2 -1.50 2
3 +0.02 5
4 -0.42 4
5 -2.37 1
6 -0.91 3
7 -0.42 4
8 -0.42 4
9 -0.42 4
10 -0.91 3
11 +2.33 9
12 +2.33 9
13 +0.46 6
14 +2.33 9
15 +2.33 9
16 ***** 10
All examinees with the same raw score obtained the same ability estimate
Test CalibrationTest Calibration
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
8484
ABILITY ESTIMATION
Examinee Obtained Raw Score
1 -1.50 2
2 -1.50 2
3 +0.02 5
4 -0.42 4
5 -2.37 1
6 -0.91 3
7 -0.42 4
8 -0.42 4
9 -0.42 4
10 -0.91 3
11 +2.33 9
12 +2.33 9
13 +0.46 6
14 +2.33 9
15 +2.33 9
16 ***** 10
All examinees with the same raw score obtained the same ability estimate

•This unique feature is a direct consequence of fixing a =1 for all items
-intuitive appeal of Rasch model
•When 2-PL or 3-PL ICC are used, an examinee’s ability estimate depends upon
the particular pattern of item responses rather than the raw score.
-Under these models, examinees with the same item response pattern
will obtain the same ability estimate.
-examinees with the same raw score could obtain different ability
estimates if they answered different items correctly.
Test CalibrationTest Calibration
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
8585
•This unique feature is a direct consequence of fixing a =1 for all items
-intuitive appeal of Rasch model
•When 2-PL or 3-PL ICC are used, an examinee’s ability estimate depends upon
the particular pattern of item responses rather than the raw score.
-Under these models, examinees with the same item response pattern
will obtain the same ability estimate.
-examinees with the same raw score could obtain different ability
estimates if they answered different items correctly.

Illustration:
Three different ten-item tests measuring the same latent traitwill be used. A
common group of 16 examinees will take all three of the tests.
The tests were created so that the average difficulty of the first test was
matched to the mean ability of the common group of examinees.
The second test was created to be an easy test for this group.
The third test was created to be a hard test for this group. Each of these test-
group combinations will be subjected to the Birnbaum paradigm and
calibrated separately.
Each test calibration yields a unique metric for the ability scale.
Due to the anchoring process, all three test calibrations yielded a mean item
difficulty of zero.
Test CalibrationTest Calibration
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
8686
Illustration:
Three different ten-item tests measuring the same latent traitwill be used. A
common group of 16 examinees will take all three of the tests.
The tests were created so that the average difficulty of the first test was
matched to the mean ability of the common group of examinees.
The second test was created to be an easy test for this group.
The third test was created to be a hard test for this group. Each of these test-
group combinations will be subjected to the Birnbaum paradigm and
calibrated separately.
Each test calibration yields a unique metric for the ability scale.
Due to the anchoring process, all three test calibrations yielded a mean item
difficulty of zero.

Within each calibration, examinees with the same raw test score obtained the
same estimated ability.
However, a given raw score will not yield the same estimated ability across the
three calibrations.
The value of the mean estimated abilities is expressed relative to the mean item
difficulty of the test.
Thus, the mean difficulty of the easy test should result in a positive mean ability.
The mean ability on the hard test should have a negative value. The mean
ability on the matched test should be near zero.
Test results can be placed on a common ability scale.
Test CalibrationTest Calibration
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
8787
Within each calibration, examinees with the same raw test score obtained the
same estimated ability.
However, a given raw score will not yield the same estimated ability across the
three calibrations.
The value of the mean estimated abilities is expressed relative to the mean item
difficulty of the test.
Thus, the mean difficulty of the easy test should result in a positive mean ability.
The mean ability on the hard test should have a negative value. The mean
ability on the matched test should be near zero.
Test results can be placed on a common ability scale.

Putting the Three Tests on a Common Ability Scale (Test Equating)
Theprinciple of the item invariance of an examinee’s abilityindicates that an
examinee should obtain the same ability estimate regardless of the set of items
used
However, in the three test calibrations done above, this did not hold.
The problem is not in the invariance principle, but in the test calibrations
The invariance principle assumes that the values of the item parameters of the
several sets of items are all expressed in the same ability-scale metric. In the
present situation,there are three different ability scales, one from each of the
calibrations.
The average difficulties of these tests were intended to be different, but the
anchoring process forced each test to have a mean item difficulty of zero.
Test CalibrationTest Calibration
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
8888
Putting the Three Tests on a Common Ability Scale (Test Equating)
Theprinciple of the item invariance of an examinee’s abilityindicates that an
examinee should obtain the same ability estimate regardless of the set of items
used
However, in the three test calibrations done above, this did not hold.
The problem is not in the invariance principle, but in the test calibrations
The invariance principle assumes that the values of the item parameters of the
several sets of items are all expressed in the same ability-scale metric. In the
present situation,there are three different ability scales, one from each of the
calibrations.
The average difficulties of these tests were intended to be different, but the
anchoring process forced each test to have a mean item difficulty of zero.

The mean of the common group was .06 for the matched test, .44 for the easy test,
and-.11 for the hard test.
This tells us that the mean ability from the matched test is about what it should be.
The mean from the easy test tells us that the average ability is above the mean
item difficulty of the test
The mean ability from the hard test is below the mean item difficulty.
Test CalibrationTest Calibration
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
8989
The mean of the common group was .06 for the matched test, .44 for the easy test,
and-.11 for the hard test.
This tells us that the mean ability from the matched test is about what it should be.
The mean from the easy test tells us that the average ability is above the mean
item difficulty of the test
The mean ability from the hard test is below the mean item difficulty.

we can use the mean abilities to position the tests on a common scale.
But which particular test calibration to use as the baseline?
Matched Test : This calibration yielded a mean ability of .062 and a mean item
difficulty of zero.
Because the Rasch model was used, the unit of measurement for all three
calibrations is unity. Therefore, to bring the easy and hard test results to the
baseline metric only involved adjusting for the differences in midpoints.
Test CalibrationTest Calibration
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
9090
we can use the mean abilities to position the tests on a common scale.
But which particular test calibration to use as the baseline?
Matched Test : This calibration yielded a mean ability of .062 and a mean item
difficulty of zero.
Because the Rasch model was used, the unit of measurement for all three
calibrations is unity. Therefore, to bring the easy and hard test results to the
baseline metric only involved adjusting for the differences in midpoints.

Easy Test
The shift factor needed is the difference between the mean estimated ability of the
common group on the easy test (.444) and on the matched test (.062),
which is .382.
To convert the values of the item difficulties for the easy test to baseline metric, one
simply subtracts .382 from each item difficulty.
Similarly, each examinee’s ability can be expressed in the baseline metric by
subtracting .382 from it.
Test CalibrationTest Calibration
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
9191
Easy Test
The shift factor needed is the difference between the mean estimated ability of the
common group on the easy test (.444) and on the matched test (.062),
which is .382.
To convert the values of the item difficulties for the easy test to baseline metric, one
simply subtracts .382 from each item difficulty.
Similarly, each examinee’s ability can be expressed in the baseline metric by
subtracting .382 from it.

Hard Test
The hard test results can be expressed in the baseline metric by using the
differences in mean ability. The shift factor is-.111-.062, or-.173.
Again, subtracting this value from each of the item difficulty estimates puts
them in the baseline metric.
The ability estimates of the common group yielded by the hard test can be
transformed to the baseline metric of the matched test by using the same
shift factor
Test CalibrationTest Calibration
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
9292
Hard Test
The hard test results can be expressed in the baseline metric by using the
differences in mean ability. The shift factor is-.111-.062, or-.173.
Again, subtracting this value from each of the item difficulty estimates puts
them in the baseline metric.
The ability estimates of the common group yielded by the hard test can be
transformed to the baseline metric of the matched test by using the same
shift factor

Item difficulties in the baseline metric
ItemEasy test Matched testHard test
1 -1.492 -2.37 *****
2 -1.492 -.27 -.037
3 -2.122 -.27 -.497
4 -.182 .98 -.497
5 -.562 -1.00 .963
6 +.178 .11 -.497
7 .528 .11 .383
8 .582 .521 .533
9 .880 .11 .443
10 .880 2.06 *****
Mean-.285mean0.00mean .224
Test CalibrationTest Calibration
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
9393
Item difficulties in the baseline metric
ItemEasy test Matched testHard test
1 -1.492 -2.37 *****
2 -1.492 -.27 -.037
3 -2.122 -.27 -.497
4 -.182 .98 -.497
5 -.562 -1.00 .963
6 +.178 .11 -.497
7 .528 .11 .383
8 .582 .521 .533
9 .880 .11 .443
10 .880 2.06 *****
Mean-.285mean0.00mean .224

Ability estimates of the common group in the baseline metric
Item Easy test Matched test Hard test
1 -2.900 -.1.50 *****
2 -.772 -1.50 *****
3 -1.96 2.02 *****
4 -.292 -.42 -.877
5 -.292 -2.37 -.877
6 .168 -.91 *****
7 1.968 -.42 -1.637
8 .168 -.42 -.877
9 .638 -.42 -1.637
10 .638 -.91 -.877
11 .638 2.33 .153
12 1.188 2.33 .153
13 .292 .46 .153
14 1.968 2.33 2.003
15 ***** 2.33 1.213
16 ***** ***** 2.003
Mean .062 .062 .062
Std. Dev. 1.344 1.566 1.413
Test CalibrationTest Calibration
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
9494
Ability estimates of the common group in the baseline metric
Item Easy test Matched test Hard test
1 -2.900 -.1.50 *****
2 -.772 -1.50 *****
3 -1.96 2.02 *****
4 -.292 -.42 -.877
5 -.292 -2.37 -.877
6 .168 -.91 *****
7 1.968 -.42 -1.637
8 .168 -.42 -.877
9 .638 -.42 -1.637
10 .638 -.91 -.877
11 .638 2.33 .153
12 1.188 2.33 .153
13 .292 .46 .153
14 1.968 2.33 2.003
15 ***** 2.33 1.213
16 ***** ***** 2.003
Mean .062 .062 .062
Std. Dev. 1.344 1.566 1.413

After transformation
• The matched test has a mean at the midpoint of the baseline ability
scale.
• The easy test has a negative value
• The hard test has a positive value.
•The average difficulty of both tests is about the same distance from the
middle of the scale.
In technical termswe have “equated” the tests, i.e., put them on a common
scale.
Test CalibrationTest Calibration
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
9595
After transformation
• The matched test has a mean at the midpoint of the baseline ability
scale.
• The easy test has a negative value
• The hard test has a positive value.
•The average difficulty of both tests is about the same distance from the
middle of the scale.
In technical termswe have “equated” the tests, i.e., put them on a common
scale.

•The mean estimated ability of the common group was the same for all
three tests.
•The standard deviations of the ability estimates were nearly the same for
the easy and hard tests, and that for the matched test was “in the
ballpark.”
•Although the summary statistics were quite similar for all three sets of
results, the ability estimates for a given examinee varied widely.
•Theinvariance principle has not gone awry; what you are seeing is
sampling variation.
Given the small size of the data sets, it is quite amazing that the results came
out as nicely as they did.
This demonstrates rather clearly the powerful capabilities of the Rasch
model and Birnbaum’s MLE paradigm.
Test CalibrationTest Calibration
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
9696
•The mean estimated ability of the common group was the same for all
three tests.
•The standard deviations of the ability estimates were nearly the same for
the easy and hard tests, and that for the matched test was “in the
ballpark.”
•Although the summary statistics were quite similar for all three sets of
results, the ability estimates for a given examinee varied widely.
•Theinvariance principle has not gone awry; what you are seeing is
sampling variation.
Given the small size of the data sets, it is quite amazing that the results came
out as nicely as they did.
This demonstrates rather clearly the powerful capabilities of the Rasch
model and Birnbaum’s MLE paradigm.

•After equating, the numerical values of the item parameters can be used
to compare where different items function on the ability scale.
•The examinees’ estimated abilities also are expressed in this metric and
can be compared.
•It is also possible to compute the test characteristic curve and the test
information function for the easy and hard tests in the baseline metric.
•Technically speaking, the tests were equated using the common group
approach with tests of different difficulty.
•The ease with which test equating can be accomplished isone of the
major advantages of item response theory over classical test theory.
Test CalibrationTest Calibration
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
9797
•After equating, the numerical values of the item parameters can be used
to compare where different items function on the ability scale.
•The examinees’ estimated abilities also are expressed in this metric and
can be compared.
•It is also possible to compute the test characteristic curve and the test
information function for the easy and hard tests in the baseline metric.
•Technically speaking, the tests were equated using the common group
approach with tests of different difficulty.
•The ease with which test equating can be accomplished isone of the
major advantages of item response theory over classical test theory.

RECAP
1.The end product of the test calibration process is the definition of an
ability scale metric.
2.Under the Rasch model, this scale has a unit of measurement of 1 and
a midpoint of zero.
3.However, it is not the metric of the underlying latent trait.The
obtained metric depends upon the item responses yielded by a
particular combination of examinees and test items being subjected
to the Birnbaum paradigm.
4.Since the true metric of the underlying latent trait cannot be
determined, the metric yielded by the Birnbaum paradigm is used as
if it were the true metric. The obtained item difficulty values and the
examinee’s ability are interpreted in this metric.Thus, the test has
been calibrated.
Test CalibrationTest Calibration
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
9898
RECAP
1.The end product of the test calibration process is the definition of an
ability scale metric.
2.Under the Rasch model, this scale has a unit of measurement of 1 and
a midpoint of zero.
3.However, it is not the metric of the underlying latent trait.The
obtained metric depends upon the item responses yielded by a
particular combination of examinees and test items being subjected
to the Birnbaum paradigm.
4.Since the true metric of the underlying latent trait cannot be
determined, the metric yielded by the Birnbaum paradigm is used as
if it were the true metric. The obtained item difficulty values and the
examinee’s ability are interpreted in this metric.Thus, the test has
been calibrated.

5.The outcome of the test calibration procedure is to locate each
examinee and item along the obtained ability scale.
6.In the present example, item 5 had a difficulty of-1 and examinee 10
had an ability estimate of-.91. Therefore, the probability of examinee
10 answering item 5 correctly is approximately .5.
7.The capability to locate items and examinees along a
common scale is a powerful feature of item response theory.
Single Consistent framework
Test CalibrationTest Calibration
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
9999
5.The outcome of the test calibration procedure is to locate each
examinee and item along the obtained ability scale.
6.In the present example, item 5 had a difficulty of-1 and examinee 10
had an ability estimate of-.91. Therefore, the probability of examinee
10 answering item 5 correctly is approximately .5.
7.The capability to locate items and examinees along a
common scale is a powerful feature of item response theory.
Single Consistent framework

During this transitional period in testing practices, many tests have been
designed and constructed using classical test theory principles but have
been analyzed via item response theory procedures.
This lack of congruence between the construction and analysis procedures
has kept the full power of item response theory from being exploited.
In order to obtain the many advantages of item response theory, tests
should be designed, constructed, analyzed, and interpreted within the
framework of the theory.
Specifying the characteristicsSpecifying the characteristics
of a testof a test
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
100100
During this transitional period in testing practices, many tests have been
designed and constructed using classical test theory principles but have
been analyzed via item response theory procedures.
This lack of congruence between the construction and analysis procedures
has kept the full power of item response theory from being exploited.
In order to obtain the many advantages of item response theory, tests
should be designed, constructed, analyzed, and interpreted within the
framework of the theory.

•Under item response theory, a well-defined set of procedures (item
banking) is used to establish and maintain item pools.
•item parameters are expressed in a known ability-scale metric.
•it is possible to select items from the item pool and determine the major
technical characteristics of a test before it is administered.
•If the test characteristics do not meet the design goals, selected items
can be replaced by other items from the item pool until the desired
characteristics are obtained.
•In this way, considerable time and money that would ordinarily be devoted
to piloting the test are saved.
Specifying the characteristicsSpecifying the characteristics
of a testof a test
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
101101
•Under item response theory, a well-defined set of procedures (item
banking) is used to establish and maintain item pools.
•item parameters are expressed in a known ability-scale metric.
•it is possible to select items from the item pool and determine the major
technical characteristics of a test before it is administered.
•If the test characteristics do not meet the design goals, selected items
can be replaced by other items from the item pool until the desired
characteristics are obtained.
•In this way, considerable time and money that would ordinarily be devoted
to piloting the test are saved.

•first define the latent trait the items are to measure, write items to
measure this trait, and pilot test the items to weed out poor items.
•This large set of items is then administered to a large group of examinees.
•An item characteristic curve model is selected, the examinees’ item
response data are analyzed via the Birnbaum paradigm, and the test is
calibrated.
•The ability scale resulting from this calibration is considered to be the
baseline metric of the item pool.
•From a test construction point of view, we now have a set of items whose
item parameter values are known; in technical terms, a “precalibrated
item pool” exists.
Specifying the characteristicsSpecifying the characteristics
of a testof a test
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
102102
•first define the latent trait the items are to measure, write items to
measure this trait, and pilot test the items to weed out poor items.
•This large set of items is then administered to a large group of examinees.
•An item characteristic curve model is selected, the examinees’ item
response data are analyzed via the Birnbaum paradigm, and the test is
calibrated.
•The ability scale resulting from this calibration is considered to be the
baseline metric of the item pool.
•From a test construction point of view, we now have a set of items whose
item parameter values are known; in technical terms, a “precalibrated
item pool” exists.

Developing a Test From a Precalibrated Item Pool
Since the items in the precalibrated item pool measure a specific latent trait,
tests constructed from it will also measure this trait.
Alternate forms are routinely needed to maintain test security, and special
versions of the test can be used to award scholarships.
In such cases, items would be selected from the item pool on the basis of
their content and their technical characteristics to meet the particular
testing goals.
The advantage of having a precalibrated item pool is that theparameter
values of the items included in the test can be used to compute the test
characteristic curve and the test information function before the test is
administered.
Specifying the characteristicsSpecifying the characteristics
of a testof a test
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
103103
Developing a Test From a Precalibrated Item Pool
Since the items in the precalibrated item pool measure a specific latent trait,
tests constructed from it will also measure this trait.
Alternate forms are routinely needed to maintain test security, and special
versions of the test can be used to award scholarships.
In such cases, items would be selected from the item pool on the basis of
their content and their technical characteristics to meet the particular
testing goals.
The advantage of having a precalibrated item pool is that theparameter
values of the items included in the test can be used to compute the test
characteristic curve and the test information function before the test is
administered.

This is possible becauseneither of these curves depends upon the
distribution of examinee ability scores over the ability scale.
Given these two curves, the test constructor has a very good idea of how the
test will perform before it is given to a group of examinees.
In addition, when the test has been administered and calibrated, test
equating procedures can be used to express the ability estimates of the
new group of examinees in the metric of the item pool.
Specifying the characteristicsSpecifying the characteristics
of a testof a test
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
104104
This is possible becauseneither of these curves depends upon the
distribution of examinee ability scores over the ability scale.
Given these two curves, the test constructor has a very good idea of how the
test will perform before it is given to a group of examinees.
In addition, when the test has been administered and calibrated, test
equating procedures can be used to express the ability estimates of the
new group of examinees in the metric of the item pool.

Screening tests
to distinguish rather sharply between examinees whose abilities are just
below a given ability level and those who are at or above that level.
-used to assign scholarships
-and to assign students to specific instructional programs
Broad-ranged tests
to measure ability over a wide range of underlying ability scale.
-Tests measuring reading or mathematics are typically broad-range tests.
Specifying the characteristicsSpecifying the characteristics
of a testof a test
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
105105
Screening tests
to distinguish rather sharply between examinees whose abilities are just
below a given ability level and those who are at or above that level.
-used to assign scholarships
-and to assign students to specific instructional programs
Broad-ranged tests
to measure ability over a wide range of underlying ability scale.
-Tests measuring reading or mathematics are typically broad-range tests.

Peaked tests
to measure ability well in a range of ability that is wider than that of a
screening test, but not as wide as that of a broad-range test.
Some Ground Rules
a.It is assumed that the items would be selected on the basis of
content as well as parameter values.
b.No two items in the item pool possess exactly the same combination
of item parameter values.
c.The item parameter values are subject to the following constraints:
-3 < =b< = +3 , .50 < =a< +2.00 , 0 < =c< = .35
The values of the discrimination parameter have been restricted to reflect the
range of values usually seen in well-maintained item pools.
Specifying the characteristicsSpecifying the characteristics
of a testof a test
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
106106
Peaked tests
to measure ability well in a range of ability that is wider than that of a
screening test, but not as wide as that of a broad-range test.
Some Ground Rules
a.It is assumed that the items would be selected on the basis of
content as well as parameter values.
b.No two items in the item pool possess exactly the same combination
of item parameter values.
c.The item parameter values are subject to the following constraints:
-3 < =b< = +3 , .50 < =a< +2.00 , 0 < =c< = .35
The values of the discrimination parameter have been restricted to reflect the
range of values usually seen in well-maintained item pools.

Example Case
You areto construct a ten-item screening test that will separate examinees
into two groups: those who need remedial instruction and those who
don’t, on the ability measured by the items in the item pool. Students
whose ability falls below a value of-1 will receive the instruction.
Solution:
b=-1.8,a= 1.22 b=-1.6,a= 1.43 b=-1.4,a= 1.14
b=-1.2,a= 1.35 b=-1.0,a= 1.56 b=-.8,a= 1.07
b=-.6,a= 1.48 b=-.4,a= 1.29 b=-.2,a= 1.110
b= 0.0,a= 1.3
The logic underlying these choices was one ofcentering the
difficulties on the cutoff level of-1 and using moderate values of
discrimination.
Specifying the characteristicsSpecifying the characteristics
of a testof a test
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
107107
Example Case
You areto construct a ten-item screening test that will separate examinees
into two groups: those who need remedial instruction and those who
don’t, on the ability measured by the items in the item pool. Students
whose ability falls below a value of-1 will receive the instruction.
Solution:
b=-1.8,a= 1.22 b=-1.6,a= 1.43 b=-1.4,a= 1.14
b=-1.2,a= 1.35 b=-1.0,a= 1.56 b=-.8,a= 1.07
b=-.6,a= 1.48 b=-.4,a= 1.29 b=-.2,a= 1.110
b= 0.0,a= 1.3
The logic underlying these choices was one ofcentering the
difficulties on the cutoff level of-1 and using moderate values of
discrimination.

The mid-true score corresponded to an ability level of-1.0.
The test characteristic curve was not particularly steep at the cutoff level,
indicating that thetest lacked discrimination.
The peak of theinformation functionoccurred at an ability level of-1.0, but
the maximum was a bit small.
The results suggest that thetest was
properly positioned on the ability scale but that a better set of items could be
found.
Specifying the characteristicsSpecifying the characteristics
of a testof a test
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
108108
The mid-true score corresponded to an ability level of-1.0.
The test characteristic curve was not particularly steep at the cutoff level,
indicating that thetest lacked discrimination.
The peak of theinformation functionoccurred at an ability level of-1.0, but
the maximum was a bit small.
The results suggest that thetest was
properly positioned on the ability scale but that a better set of items could be
found.

The following changes would improve the test’s characteristics:
Firstcluster the values of the item difficulties nearer the cutoff level;
second, use larger values of the discrimination parameters.
These two changes should steepen the test characteristic curve and
increase the maximum amount of information at the ability level of
-1.0.
Specifying the characteristicsSpecifying the characteristics
of a testof a test
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
109109
The following changes would improve the test’s characteristics:
Firstcluster the values of the item difficulties nearer the cutoff level;
second, use larger values of the discrimination parameters.
These two changes should steepen the test characteristic curve and
increase the maximum amount of information at the ability level of
-1.0.

RECAP
1.Screening tests.
1. The desired test characteristic curve has the mid-true score
at the specified cutoff ability level. The curve should be as
steep as possible at that ability level.
2. The test information function should be peaked, with its
maximum at the cutoff ability level.
3. optimal case is where all item difficulties are at the cutoff point and
the item discriminations are large.
4. select items that yield the maximum amount of information at the
cutoff point.
Specifying the characteristicsSpecifying the characteristics
of a testof a test
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
110110
RECAP
1.Screening tests.
1. The desired test characteristic curve has the mid-true score
at the specified cutoff ability level. The curve should be as
steep as possible at that ability level.
2. The test information function should be peaked, with its
maximum at the cutoff ability level.
3. optimal case is where all item difficulties are at the cutoff point and
the item discriminations are large.
4. select items that yield the maximum amount of information at the
cutoff point.

2.Broad-range tests.
1. The desired test characteristic curve has its mid-true score
at an ability level corresponding to the midpoint of the range
of ability of interest.
2. Most often this is an ability level of zero.
3. The test characteristic curve should be linear for most of its range.
4. The desired test information function is horizontal over the
widest possible range.
5. The maximum amount of information should be as large as
possible.
6. The values of the item difficulty parameters should be spread
uniformly over the ability scale and as widely as practical.
Specifying the characteristicsSpecifying the characteristics
of a testof a test
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
111111
2.Broad-range tests.
1. The desired test characteristic curve has its mid-true score
at an ability level corresponding to the midpoint of the range
of ability of interest.
2. Most often this is an ability level of zero.
3. The test characteristic curve should be linear for most of its range.
4. The desired test information function is horizontal over the
widest possible range.
5. The maximum amount of information should be as large as
possible.
6. The values of the item difficulty parameters should be spread
uniformly over the ability scale and as widely as practical.

3.Peaked tests.
1.The desired TCC has its mid-true score in the middle of the ability range
of interest. The curve should have a moderate slope at that ability level.
2.The test information function should be rounded in appearance over the
ability range of most interest.
3.The item difficulties should be clustered around the midpoint of the ability
range of interest, but not as tightly as in the case of a screening test.
4.The values of the discrimination parameters should large.
5.Items whose difficulties are within the ability range of interest should
have larger values of the discrimination than other items
Specifying the characteristicsSpecifying the characteristics
of a testof a test
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
112112
3.Peaked tests.
1.The desired TCC has its mid-true score in the middle of the ability range
of interest. The curve should have a moderate slope at that ability level.
2.The test information function should be rounded in appearance over the
ability range of most interest.
3.The item difficulties should be clustered around the midpoint of the ability
range of interest, but not as tightly as in the case of a screening test.
4.The values of the discrimination parameters should large.
5.Items whose difficulties are within the ability range of interest should
have larger values of the discrimination than other items

4.Role of item characteristic curve models.
1.‘a’ being fixed at 1.0, the Rasch model has a limit placed upon the
maximum amount ofinformation that can be obtained.
2.The maximum amount of iteminformation is .25 sinceP
i(θ ) ) Q
i(θ ) )= .25
whenP
i(θ )= .5. Maximum amount of information for a test under the
Rasch model is .25 times the number of items.
3.Due to the presence of the guessing parameter, 3-PL model will yield a
more linear test characteristic curve and a test information function with a
lower general level than under 2-PL model with the ‘a’ and ‘b’
4.The information function under a 2-PL model is the upper bound for the
information function under a 3-PL model when the values of ‘b’and ‘a’are
the same.
Specifying the characteristicsSpecifying the characteristics
of a testof a test
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
113113
4.Role of item characteristic curve models.
1.‘a’ being fixed at 1.0, the Rasch model has a limit placed upon the
maximum amount ofinformation that can be obtained.
2.The maximum amount of iteminformation is .25 sinceP
i(θ ) ) Q
i(θ ) )= .25
whenP
i(θ )= .5. Maximum amount of information for a test under the
Rasch model is .25 times the number of items.
3.Due to the presence of the guessing parameter, 3-PL model will yield a
more linear test characteristic curve and a test information function with a
lower general level than under 2-PL model with the ‘a’ and ‘b’
4.The information function under a 2-PL model is the upper bound for the
information function under a 3-PL model when the values of ‘b’and ‘a’are
the same.

5.Role of the number of items.
Increasing the number of items has little impact upon the general form of the
TCC if the distribution of the sets of item parameters remains the same.
Increasing the number of items in a test has a significant impact upon the
general level of the test information function (TIF).
Select items having high values of the ‘a’ and a distribution of item
difficulties consistent with the testing goals.
Pairing if item parameters is vital . Eg choosing ‘a’ item whose difficulty is
not of interest does little for test information function or the slope of the
TCC.
It’s important to see ICC and the IIF to ascertain the item’scontribution to the
TCC and to the TIF
Partial Credit Model
Specifying the characteristicsSpecifying the characteristics
of a testof a test
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
114114
5.Role of the number of items.
Increasing the number of items has little impact upon the general form of the
TCC if the distribution of the sets of item parameters remains the same.
Increasing the number of items in a test has a significant impact upon the
general level of the test information function (TIF).
Select items having high values of the ‘a’ and a distribution of item
difficulties consistent with the testing goals.
Pairing if item parameters is vital . Eg choosing ‘a’ item whose difficulty is
not of interest does little for test information function or the slope of the
TCC.
It’s important to see ICC and the IIF to ascertain the item’scontribution to the
TCC and to the TIF
Partial Credit Model

•The computer can update the estimate of the examinee's ability after
each item, which then is used to select subsequent item.
•With the right item bank and a high examinee ability variance, CAT
can be muchmore efficientthan a traditional paper-and-pencil test.
•Paper-and-pencil tests
•"fixed-item" tests
•Everyone takes every item
•Easy and hard items are like adding constants to the score.
•provide little information about the examinee's ability level.
•Large numbers of items and examinees are needed to obtain a
modest degree of precision.
Computer Adaptive TestComputer Adaptive Test
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
115115
•The computer can update the estimate of the examinee's ability after
each item, which then is used to select subsequent item.
•With the right item bank and a high examinee ability variance, CAT
can be muchmore efficientthan a traditional paper-and-pencil test.
•Paper-and-pencil tests
•"fixed-item" tests
•Everyone takes every item
•Easy and hard items are like adding constants to the score.
•provide little information about the examinee's ability level.
•Large numbers of items and examinees are needed to obtain a
modest degree of precision.

Computer adaptive tests
•the examinee's ability level relative to a norm group can be iteratively
estimated .
•items can be selected based on the current ability estimate.
•Examinees can be given the items that maximize the information
(within constraints) about their ability levels from the item responses.
•Examinees will receive few items that are very easy or very hard for
them. (little information is gained of this)
•Reduced standard errors (reciprocal of I) and greater precision with
only a handful of properly selected items.
Computer Adaptive TestComputer Adaptive Test
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
116116
Computer adaptive tests
•the examinee's ability level relative to a norm group can be iteratively
estimated .
•items can be selected based on the current ability estimate.
•Examinees can be given the items that maximize the information
(within constraints) about their ability levels from the item responses.
•Examinees will receive few items that are very easy or very hard for
them. (little information is gained of this)
•Reduced standard errors (reciprocal of I) and greater precision with
only a handful of properly selected items.

The CAT algorithmis usually an iterative process
Step 1:All the items that have not yet been administered are evaluated to
determine which will be the best one to administer next given the
currently estimated ability level
Step 2:The "best" next item (providing the most information) is
administered and the examinee responds
Step 3:A new ability estimate is computed based on the responses to all
of the administered items.
Steps 1 through 3 are repeated until a stopping criterion is met.
Similar to Newton-Raphson iterative method for solving equations
Computer Adaptive TestComputer Adaptive Test
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
117117
The CAT algorithmis usually an iterative process
Step 1:All the items that have not yet been administered are evaluated to
determine which will be the best one to administer next given the
currently estimated ability level
Step 2:The "best" next item (providing the most information) is
administered and the examinee responds
Step 3:A new ability estimate is computed based on the responses to all
of the administered items.
Steps 1 through 3 are repeated until a stopping criterion is met.
Similar to Newton-Raphson iterative method for solving equations

Reliability and standard error
•In classical measurement, with a test reliability of 0.90, the
standard error of measurement for the test is about .33 of the
standard deviation of examinee test scores.
•In item response theory-based measurement, and when ability
scores are scaled to a mean of zero and a standard deviation of
one (which is common), this level of reliability corresponds to a
standard error of about .33 and test information of about 10.
•Thus, it is common in practice, to design CATs so that the standard
errors are about .33 or smaller (or correspondingly, test information
exceeds 10.
Computer Adaptive TestComputer Adaptive Test
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
118118
Reliability and standard error
•In classical measurement, with a test reliability of 0.90, the
standard error of measurement for the test is about .33 of the
standard deviation of examinee test scores.
•In item response theory-based measurement, and when ability
scores are scaled to a mean of zero and a standard deviation of
one (which is common), this level of reliability corresponds to a
standard error of about .33 and test information of about 10.
•Thus, it is common in practice, to design CATs so that the standard
errors are about .33 or smaller (or correspondingly, test information
exceeds 10.

Potential of computer adaptive tests
•Tests are given "on demand" and scores are available immediately.
•Neither answer sheets nor trained test administrators are needed.
•Test administrator differences are eliminated as a factor in
measurement error.
•Tests are individually paced.
•Test security may be increased .
•Computerized testing offers a number of options for timing and
formatting.
Computer Adaptive TestComputer Adaptive Test
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
119119
Potential of computer adaptive tests
•Tests are given "on demand" and scores are available immediately.
•Neither answer sheets nor trained test administrators are needed.
•Test administrator differences are eliminated as a factor in
measurement error.
•Tests are individually paced.
•Test security may be increased .
•Computerized testing offers a number of options for timing and
formatting.

•CATs can reduce testing time by more than 50% while maintaining the
same level of reliability. Shorter testing times also reduce fatigue, a
factor that can significantly affect an examinee's test results.
•CATs can provide accurate scores over a wide range of abilities while
traditional tests are usually most accurate for average examinees.
CAT & IRT is Powerful combination
Computer Adaptive TestComputer Adaptive Test
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
120120

Limitations of CAT
•CATs are not applicable for all subjects and skills.
•Most CATs are based on an IRT model, yet IRT is not applicable to all
skills and item types.
•Hardware limitations
•Items involving detailed art work and graphs or extensive reading
passages, for example, may be hard to present.
•CATs require careful item calibration.
•CATs are only manageable if a facility has enough computers for a large
number of examinees and the examinees are at least partially
computer-literate. This can be a big limitation.
Computer Adaptive TestComputer Adaptive Test
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
121121
Limitations of CAT
•CATs are not applicable for all subjects and skills.
•Most CATs are based on an IRT model, yet IRT is not applicable to all
skills and item types.
•Hardware limitations
•Items involving detailed art work and graphs or extensive reading
passages, for example, may be hard to present.
•CATs require careful item calibration.
•CATs are only manageable if a facility has enough computers for a large
number of examinees and the examinees are at least partially
computer-literate. This can be a big limitation.

Thetest administration procedures are different. This may cause problems
for some examinees.
Witheach examinee receiving a different set of questions, there can be
perceived inequities.
Examinees are not usually permitted to go back and change answers.
Computer Adaptive TestComputer Adaptive Test
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
122122

Baker,J “The basics of Item Response Theory.”
Birnbaum, A. “Some latent trait models and their use in inferring an
examinee’s ability.” Part 5 in F.M. Lord and M.R. Novick.Statistical Theories of Mental Test Scores.
Reading, MA: Addison-Wesley, 1968.
Hambleton, R.K., and Swaminathan, H.Item Response Theory: Principles and
Applications. Hingham, MA: Kluwer, Nijhoff, 1984.
Hulin, C. L., Drasgow, F., and Parsons, C.K.Item Response Theory: Application to Psychological
Measurement. Homewood, IL: Dow-Jones, Irwin: 1983.
Lord, F.M.Applications of Item Response Theory to Practical Testing Problems. Hillsdale, NJ:
Erlbaum, 1980.
Mislevy, R.J., and Bock, R.D.PC-BILOG 3: Item Analysis and Test Scoring with Binary Logistic
Models. Mooresville, IN: Scientific Software, Inc, 1986.
Wright, B.D., and Mead, R.J. BICAL: Calibrating Items with the Rasch Model. Research
Memorandum No. 23. Statistical Laboratory, Department of Education, University of Chicago,
1976.
Wright, B.D., and Stone, M.A.Best Test Design. Chicago: MESA Press, 1979.
NCME Instructional Modules
ReferencesReferences
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
123123
Baker,J “The basics of Item Response Theory.”
Birnbaum, A. “Some latent trait models and their use in inferring an
examinee’s ability.” Part 5 in F.M. Lord and M.R. Novick.Statistical Theories of Mental Test Scores.
Reading, MA: Addison-Wesley, 1968.
Hambleton, R.K., and Swaminathan, H.Item Response Theory: Principles and
Applications. Hingham, MA: Kluwer, Nijhoff, 1984.
Hulin, C. L., Drasgow, F., and Parsons, C.K.Item Response Theory: Application to Psychological
Measurement. Homewood, IL: Dow-Jones, Irwin: 1983.
Lord, F.M.Applications of Item Response Theory to Practical Testing Problems. Hillsdale, NJ:
Erlbaum, 1980.
Mislevy, R.J., and Bock, R.D.PC-BILOG 3: Item Analysis and Test Scoring with Binary Logistic
Models. Mooresville, IN: Scientific Software, Inc, 1986.
Wright, B.D., and Mead, R.J. BICAL: Calibrating Items with the Rasch Model. Research
Memorandum No. 23. Statistical Laboratory, Department of Education, University of Chicago,
1976.
Wright, B.D., and Stone, M.A.Best Test Design. Chicago: MESA Press, 1979.
NCME Instructional Modules

September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
124124
Thank You

?nClassical test theory methods confound “bias” with true mean
differences; IRT does not. In IRT terminology, item/test bias is
referred to as DIF/DTF (Differential Item/Test Functioning)
?nDIFrefers to a difference in the probability of endorsing an item for
members of a reference group (e.g., US workers) and a focal group
(e.g., Chinese workers), having the same standing on theta.
?nDTFrefers to a difference in the test characteristic curves, obtained
by summing the item response functions for each group.
?nDTF is perhaps more important for selection because decisions are
made based on test scores, not individual item responses.
?nIf DIF is detected, IRT can control for item bias when estimating
scores.
What is IRT?What is IRT?––DIF/DTFDIF/DTF
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
125125
?nClassical test theory methods confound “bias” with true mean
differences; IRT does not. In IRT terminology, item/test bias is
referred to as DIF/DTF (Differential Item/Test Functioning)
?nDIFrefers to a difference in the probability of endorsing an item for
members of a reference group (e.g., US workers) and a focal group
(e.g., Chinese workers), having the same standing on theta.
?nDTFrefers to a difference in the test characteristic curves, obtained
by summing the item response functions for each group.
?nDTF is perhaps more important for selection because decisions are
made based on test scores, not individual item responses.
?nIf DIF is detected, IRT can control for item bias when estimating
scores.

What is IRT?What is IRT?––DIF ExamplesDIF Examples
Uniform DIF Against Focal Group
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
-3-2.5-2-1.5-1-0.500.511.522.53
Theta
Prob. of Positive Response
Reference
Focal
Nonuniform (Crossing) DIF
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
-3-2.5-2-1.5-1-0.500.511.522.53
Theta
Prob. of Positive Response
Reference
Focal
Reference group
favored at all levels
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
126126
Uniform DIF Against Focal Group
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
-3-2.5-2-1.5-1-0.500.511.522.53
Theta
Prob. of Positive Response
Reference
Focal
Nonuniform (Crossing) DIF
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
-3-2.5-2-1.5-1-0.500.511.522.53
Theta
Prob. of Positive Response
Reference
Focal
Focal favored at low theta
Reference favored at high
theta

What is IRT?What is IRT?––DIF DetectionDIF Detection
••DIFDIF
––ParametricParametric
••Lord’s ChiLord’s Chi--SquareSquare
••Likelihood Ratio TestLikelihood Ratio Test
••Signed and Unsigned Area MethodsSigned and Unsigned Area Methods
––NonparametricNonparametric
••SIBTESTSIBTEST
••MantelMantel--HaenszelHaenszel
••DTFDTF
––ParametricParametric
••Raju’s DFIT MethodRaju’s DFIT Method
––NonparametricNonparametric
••SIBTESTSIBTEST
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
127127
••DIFDIF
––ParametricParametric
••Lord’s ChiLord’s Chi--SquareSquare
••Likelihood Ratio TestLikelihood Ratio Test
••Signed and Unsigned Area MethodsSigned and Unsigned Area Methods
––NonparametricNonparametric
••SIBTESTSIBTEST
••MantelMantel--HaenszelHaenszel
••DTFDTF
––ParametricParametric
••Raju’s DFIT MethodRaju’s DFIT Method
––NonparametricNonparametric
••SIBTESTSIBTEST

Let represent the latent (factor) score for individualj .Let be
the probability that individualjresponds correctly to itemi.
Then a simple item response model is
This is just a ‘binary response’ factor analysis model.
What is IRT ?What is IRT ?--Some simple modelsSome simple models
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
128128

Lawley (1944) really started it off
Lord (1980) promoted the term ‘item response theory’ as opposed to
‘classical item analysis’
Technical elaborations include:
‘parameters’ for ‘guessing’
Partial credit (degrees of correctness) responses
Multidimensional models
BUT the ‘workhorse’ is still the Lord model (with the factor assumed
to be a random rather than fixed variable), as follows:
What is IRT ? Some simple models ?What is IRT ? Some simple models ?
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
129129
Lawley (1944) really started it off
Lord (1980) promoted the term ‘item response theory’ as opposed to
‘classical item analysis’
Technical elaborations include:
‘parameters’ for ‘guessing’
Partial credit (degrees of correctness) responses
Multidimensional models
BUT the ‘workhorse’ is still the Lord model (with the factor assumed
to be a random rather than fixed variable), as follows:

What is IRT ?What is IRT ?--Classical to IRTClassical to IRT
••Classical is really an item response modelClassical is really an item response model
(IRM)(IRM)––
A reasonable (consistent) estimate of (aA reasonable (consistent) estimate of (a
random variablerandom variable--so in red) is given by theso in red) is given by the
‘raw score’ i.e. percentage (or total) of correct‘raw score’ i.e. percentage (or total) of correct
items.items.
••A somewhat more efficient estimate is given byA somewhat more efficient estimate is given by
a weighted percentage, using the as weights.a weighted percentage, using the as weights.
••The Lord model is simply:The Lord model is simply:
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
130130
••Classical is really an item response modelClassical is really an item response model
(IRM)(IRM)––
A reasonable (consistent) estimate of (aA reasonable (consistent) estimate of (a
random variablerandom variable--so in red) is given by theso in red) is given by the
‘raw score’ i.e. percentage (or total) of correct‘raw score’ i.e. percentage (or total) of correct
items.items.
••A somewhat more efficient estimate is given byA somewhat more efficient estimate is given by
a weighted percentage, using the as weights.a weighted percentage, using the as weights.
••The Lord model is simply:The Lord model is simply:
i
b

What is IRT? Item response RelationshipWhat is IRT? Item response Relationship
Sigmoid Function is a type of logistic function . The general form ofSigmoid Function is a type of logistic function . The general form of
logistic function is :logistic function is :
The special case of the logistic function withThe special case of the logistic function withaa= 1,= 1,mm= 0,= 0,nn= 1, τ = 1, namely= 1, τ = 1, namely
The logistic function is the inverse of the naturallogitfunction and so
can be used to convert the logarithm ofoddsinto aprobability;
the conversion from thelog-likelihood ratioof two alternatives also takes
the form of a sigmoid curve.
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
131131
For a single item in a test
Sigmoid Function is a type of logistic function . The general form ofSigmoid Function is a type of logistic function . The general form of
logistic function is :logistic function is :
The special case of the logistic function withThe special case of the logistic function withaa= 1,= 1,mm= 0,= 0,nn= 1, τ = 1, namely= 1, τ = 1, namely
The logistic function is the inverse of the naturallogitfunction and so
can be used to convert the logarithm ofoddsinto aprobability;
the conversion from thelog-likelihood ratioof two alternatives also takes
the form of a sigmoid curve.

What is IRT? Rasch ModelWhat is IRT? Rasch Model
Here the ‘discrimination’ is assumed to be the same for each item
The resulting (maximum likelihood) factor score estimates are then a 1–
1 transformation of the raw scores.
Rasch Model is a special case and will often not fit the data very well.
We can add further predictors, for example social background, that may
mediate the relationships.
Item response practitioners go further:
If we assume that the item parameters ( ) are the same across
populations, and, for example, tests, then we can form common
scales for different populations and different tests.
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
132132
Here the ‘discrimination’ is assumed to be the same for each item
The resulting (maximum likelihood) factor score estimates are then a 1–
1 transformation of the raw scores.
Rasch Model is a special case and will often not fit the data very well.
We can add further predictors, for example social background, that may
mediate the relationships.
Item response practitioners go further:
If we assume that the item parameters ( ) are the same across
populations, and, for example, tests, then we can form common
scales for different populations and different tests.
iiba,

We can generalise our logistic model as follows:
Adding a further factor allows an individual to be characterised by
two underlying traits.
A sensible analysis will explore the dimensionality structure of a set
of item responses
Assumptions are needed, for example that factors are independent,
or alternatively that they are correlated but each item has a non-zero
coeffcient (loading) on only 1 factor–or an intermediate assumption.
What are the consequences of a more complex structure?
What is IRT? Multidimensional ModelWhat is IRT? Multidimensional Model
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
133133
We can generalise our logistic model as follows:
Adding a further factor allows an individual to be characterised by
two underlying traits.
A sensible analysis will explore the dimensionality structure of a set
of item responses
Assumptions are needed, for example that factors are independent,
or alternatively that they are correlated but each item has a non-zero
coeffcient (loading) on only 1 factor–or an intermediate assumption.
What are the consequences of a more complex structure?

What is IRT? Multidimensional ModelWhat is IRT? Multidimensional Model
It allows a more faithful representation of multi-faceted achievement.
It allows the (multidimensional)structureof achievement to be compared
among groups or populations….in the following ways:
The correlations between factors can vary
The values of loadings can vary
The factor scores can be allowed to depend on further variablessuch
as gender and the resulting ‘regressions’ may vary. Forexample:
With extensions to multilevel modelling etc.–a ‘structural equation
model.
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
134134
It allows a more faithful representation of multi-faceted achievement.
It allows the (multidimensional)structureof achievement to be compared
among groups or populations….in the following ways:
The correlations between factors can vary
The values of loadings can vary
The factor scores can be allowed to depend on further variablessuch
as gender and the resulting ‘regressions’ may vary. Forexample:
With extensions to multilevel modelling etc.–a ‘structural equation
model.

Partial Credit Model VariationPartial Credit Model Variation
Object | Item 1Item 2
Person A>>>|<<Item 1.3
|<<Item 1.2<<<<Item 2.2
|
Person B>>>|<<<<<<<<<<<<Item 2.1
|<<Item 1.1
Central line = scale at interval vel of measurement
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
135135
Object | Item 1Item 2
Person A>>>|<<Item 1.3
|<<Item 1.2<<<<Item 2.2
|
Person B>>>|<<<<<<<<<<<<Item 2.1
|<<Item 1.1

Partial Credit Model VariationPartial Credit Model Variation
Items :
Item 1.3 refers to Item 1 Category 3
Item 1.2 refers to Item 1 Category 2 and so on.
The difficulty associated with Category 3 of Item 1 is greater than the
difficulty associated with Category 2 of Item 1, and so on (ordered
categories)
The location of Item 1.3 on the scale indicates the ability associated with a
50% probability of passing Category 3 of Item 1 (or of any of the lower
categories).
Persons :
The location of person A on the scale indicates his ability.
The probability of Person A passing categories at a lower level of difficulty
is more than 50%.
The probability of Person A passing categories at a higher level of
difficulty is less than 50%.
The probability of Person A passing categories at a level of difficulty that
is the same as his ability is 50%.
September 9, 2009September 9, 2009The Basics of IRTThe Basics of IRT
136136
Items :
Item 1.3 refers to Item 1 Category 3
Item 1.2 refers to Item 1 Category 2 and so on.
The difficulty associated with Category 3 of Item 1 is greater than the
difficulty associated with Category 2 of Item 1, and so on (ordered
categories)
The location of Item 1.3 on the scale indicates the ability associated with a
50% probability of passing Category 3 of Item 1 (or of any of the lower
categories).
Persons :
The location of person A on the scale indicates his ability.
The probability of Person A passing categories at a lower level of difficulty
is more than 50%.
The probability of Person A passing categories at a higher level of
difficulty is less than 50%.
The probability of Person A passing categories at a level of difficulty that
is the same as his ability is 50%.