Reliability in Psych Assessments ppt.
- includes types of reliability and their functions
- how to determine the reliability of a test
Size: 4.96 MB
Language: en
Added: Oct 09, 2024
Slides: 60 pages
Slide Content
UNIT VI
RELIABILITY
Presented by:
Ishi Faith M. Bautista
Crystal Marian S. Ruiz
PSYCH-3B
OBJECTIVES
After discussing this unit, you will be able:
To learn about reliability, including its history, theories, and
types.
To know the conceptualization of error in reliability.
To understand the sources of error and models of reliability.
To learn about the nature of reliability, including its
interpretation
To understand the purpose and importance of reliability
CONTENT OUTLINE
History and Theory of Reliability
The Domain Sampling Model
Item Response Theory
Models of Reliability
Other Methods of Estimating
Internal Consistency
KR20 Formula
Coefficient alpha
Measures of Inter-Scorer Reliability
Using and Interpreting a Coefficient
of Reliability
The Purpose of the Reliability
Coefficient
The Nature of the Test
Connecting Sources of Error with
Reliability Assessment Met
HISTORY AND
THEORY OF
RELIABILITY
WHAT IS RELIABILITY?
reliability is a synonym for dependability or
consistency
in the language of psychometrics, reliability
refers to consistency in measurement
whereas in everyday conversation reliability
always connotes something positive, in the
psychometric sense it really only refers to
something that is consistent—not necessarily
consistently good or bad, but simply
consistent
INTERPRETING TEST
RELIABILITY
A reliability coefficient represents the proportion of total variance
that measures true score differences among the subjects.
¨ A reliability coefficient can range from a value of 0.0 (all the variance is
measurement error) to a value of 1.00 (no measurement error). In reality,
all tests have some error, so reliability is never 1.00.
¨ A test with high reliability is desired, because lower reliability indicates
that a large proportion of test variance is measurement error.
¨ If test reliability is 0, and test scores are used to assign grades, a
student's grade would be assigned purely by chance, similar to flipping a
coin or rolling dice!
CONCEPTUALIZATION OF ERROR
Reliability estimation methods aim to quantify the proportion of variability in test scores that is
due to true differences in the trait being measured versus the proportion that is due to error. A
reliable test minimizes the impact of error, ensuring that test scores accurately reflect the
underlying trait or characteristic.
Two types:
Random Error (Unsystematic Error) - Random error is the variability in test scores that is
unpredictable and unrelated to the trait being measured. It can be caused by factors such as
temporary distractions, fluctuations in attention or motivation, or chance factors in test
administration.
Systematic Error (Bias) - Systematic error refers to consistent, non-random variability in test
scores that is associated with specific characteristics of the test or the testing situation. This
type of error can arise from flaws in the test design, ambiguous wording of test items, or cultural
biases embedded in the test content. Systematic error tends to produce consistent patterns of
distortion in test scores, leading to a lack of accuracy in measurement.
SPEARMAN’S EARLY STUDIES
Charles E. Spearman, (born September
10, 1863, London, England—died
September 17, 1945, London),
Charles Spearman was a British
psychologist known for his work in the
early 20th century. One of his most
significant contributions was his
development of factor analysis and his
theory of intelligence.
SPEARMAN’S EARLY STUDIES
Factor Analysis: Spearman is credited with developing factor
analysis as a statistical technique. Factor analysis is a method
used to identify underlying factors or dimensions that explain
patterns of correlations within a set of variables. In the context
of intelligence research, Spearman used factor analysis to
examine the relationships between various mental abilities.
SPEARMAN’S EARLY STUDIES
General Intelligence (g): One of Spearman's key insights was
the proposal of general intelligence, often denoted as "g". He
argued that while individuals may excel in specific areas of
cognition, such as verbal reasoning, spatial ability, or memory,
there is also a common underlying factor that contributes to
overall intellectual performance. This general factor,
according to Spearman, influences performance across
diverse cognitive tasks.
SPEARMAN’S EARLY STUDIES
Two-Factor Theory: In addition to general intelligence,
Spearman proposed a two-factor theory of intelligence. He
suggested that each individual's cognitive abilities are
influenced by both general intelligence (g) and specific
abilities (s). General intelligence represents the common
factor underlying performance on all cognitive tasks, while
specific abilities account for individual differences in
particular areas of cognition.
In classical test theory, a score on an ability test is presumed to reflect not
only the test taker’s true score on the ability being measured but also the
error. In its broadest sense, error refers to the component of the observed
test score that does not concern the test taker’s ability. If we use X to
represent an observed score, T to represent a true score, and E to
represent an error, then the fact that an observed score equals the true
score plus error may be expressed as follows:
CLASSICAL TEST THEORY/TEST SCORE
THEORY
CLASSICAL TEST THEORY/TEST SCORE
THEORY
CLASSICAL TEST
THEORY/TEST SCORE
Total variance - True Variance (True Differences)
Total Variance - Error Variance (Random Sources)
True variance + Error Variance = Total Variance
To illustrate, suppose the SD of a test administered to students was 2.0 (thus, total test
variance = 2.0^2 = 4.0). All of the students guessed on every question, which means that
getting the questions correct was due to luck or chance (σe2 = 4.0). Since guessing is
completely random and has nothing to do with ability (true score), there would be no
true score variance (σt2 = 0) and the components would be:
Total Variance = True Score Variance + Error Variance: 4.0 = 0.0 + 4.0
Test reliability (Rxx) is calculated from these variances:
For this example test reliability would be: Rxx = [(4.0 – 4.0)/4.0] = 0/4.0 = 0.0
The reliability of this test is 0, which means that all test variance was due to
measurement error. The test did not measure anything.
EXAMPLE
DOMAIN SAMPLING
MODEL
Domain sampling theory, also known as domain-referenced testing, is a
framework used in psychometrics to assess individuals' proficiency or
competence in specific domains or content areas.
Domains: In domain sampling theory, a domain refers to a specific area
of knowledge, skill, or ability that is being assessed by a test. Domains
can vary widely depending on the context of the assessment, such as
academic subjects (e.g., mathematics, language arts), professional
competencies (e.g., nursing skills, management abilities), or job-related
tasks (e.g., customer service, problem-solving).
Test Blueprint: A test blueprint, also known as a test specification or content
outline, is a document that outlines the content areas or domains to be
covered by the test. It specifies the relative weight or importance of each
domain in the overall test score. Test blueprints are typically developed
based on expert judgment, curriculum standards, job analysis, or other
relevant criteria.
Sampling Process: Domain sampling theory emphasizes the importance of
sampling content from each domain to ensure adequate coverage of the
construct being measured. The sampling process involves selecting
representative items or tasks from each domain that collectively assess the
full range of knowledge, skills, or abilities within that domain. This sampling
strategy helps enhance the content validity of the test by ensuring that it
accurately reflects the content domain.
Item Development: Items or test questions are developed to measure individuals'
proficiency within each domain. These items are designed to assess various cognitive
processes, such as recall, comprehension, application, analysis, synthesis, and evaluation.
Item development may involve the use of item-writing guidelines, item-analysis techniques,
and cognitive modeling approaches to ensure the quality and fairness of the items.
Scoring: Domain sampling theory often involves scoring methods that take into account the
relative importance of different domains in the overall assessment. For example, weighted
scoring may be used to assign different weights or scores to items from each domain
based on the test blueprint. Alternatively, domain-based scoring may be employed, where
individuals receive separate scores for each domain assessed by the test.
Interpretation: Test scores derived from domain sampling theory are typically interpreted
in relation to individuals' proficiency within specific content domains. Interpretation may
involve comparing individuals' scores to established performance standards, benchmark
scores, or criterion levels of proficiency within each domain. This allows for more targeted
feedback and decision-making regarding individuals' strengths and weaknesses in specific
areas.
example:
Let's say we're assessing someone's overall level of happiness. According to the
Domain Sampling Model, happiness is not just a single trait but a complex construct
made up of different domains or aspects. These domains could include family life,
social relationships, work satisfaction, and personal interests, among others.
To assess happiness using the Domain Sampling Model, we would include questions or
tasks that sample from each of these domains. For example:
Family Life: "How satisfied are you with your relationship with your family members?"
Social Relationships: "How often do you spend time with friends or loved ones?"
Work Satisfaction: "On a scale from 1 to 10, how satisfied are you with your current
job?"
Personal Interests: "How often do you engage in activities that you enjoy outside of
work or school?"
By collecting information from each of these domains, we can get a more holistic
understanding of the individual's overall happiness level.
ITEM-RESPONSE
THEORY
Item Response Theory (IRT) is a statistical framework used in
psychometrics to evaluate the relationship between an
individual's ability or trait and their performance on a test or
assessment. It's particularly useful in educational and
psychological measurement.
Item Response Theory allows us to understand how the difficulty of
test items interacts with individuals' abilities or traits to predict their
responses to those items
EXAMPLE
Let's say we have two students, Alice and Bob, who are taking a test.
Alice is very skilled, while Bob is not as skilled. Now, let's say there's a
difficult question on the test that only highly skilled individuals like Alice
can answer correctly. On the other hand, there's an easy question that
both Alice and Bob can answer correctly.
With IRT, we can model the probability of each student answering
each question correctly based on their ability level. For example, Alice
has a high probability of answering the difficult question correctly
because of her high ability level, while Bob has a lower probability of
answering it correctly. Conversely, both Alice and Bob have a high
probability of answering the easy question correctly.
One source of variance during test construction is
item sampling or content sampling, terms that
refer to variation among items within a test as
well as to variation among items between tests.
The higher score would be due to the specific
content sampled, the way the items were
worded, and so on. The extent to which a test
taker’s score is affected by the content sampled
on the test and by the way the content is sampled
(that is, the way in which the item is constructed)
is a source of error variance.
SOURCES OF ERROR VARIANCE: TEST
CONSTRUCTION
SOURCES OF ERROR VARIANCE: TEST
ADMINISTRATION
Sources of error variance that occur during
test administration may influence the test
taker’s attention or motivation. The test
taker’s reactions to those influences are the
source of one kind of error variance.
Examples of untoward influences during the
administration of a test include factors
related to the test environment: the room
temperature, the level of lighting, and the
amount of ventilation and noise, for
instance.
SOURCES OF ERROR VARIANCE: TEST
ADMINISTRATION
Other potential sources of error variance during
test administration are testtaker variables.
Pressing emotional problems, physical
discomfort, lack of sleep, and the effects of
drugs or medication can all be sources of error
variance. A testtaker may, for whatever reason,
make a mistake in entering a test response. For
example, the examinee might blacken a “b” grid
when he or she meant to blacken the “d” grid.
An examinee may simply misread a test item.
SOURCES OF ERROR VARIANCE: TEST
ADMINISTRATION
Examiner-related variables are potential sources of error variance. The
examiner’s physical appearance and demeanor—even the presence or
absence of an examiner—are some factors for consideration here.
On an oral examination, some examiners may unwittingly provide clues by
emphasizing key words as they pose questions. They might convey
information about the correctness of a response through head nodding, eye
movements, or other nonverbal gestures. Clearly, the level of professionalism
exhibited by examiners is a source of error variance.
The advent of computer scoring and a growing
reliance on objective, computer-scorable items
virtually have eliminated error variance caused
by scorer differences in many tests. However,
not all tests can be scored from grids
blackened by No. 2 pencils. Individually
administered intelligence tests, some tests of
personality, tests of creativity, various
behavioral measures, and countless other tests
still require hand scoring by trained personnel.
SOURCES OF ERROR VARIANCE: TEST
SCORING AND INTERPRETATION
Certain types of assessment situations lend themselves to particular varieties
of systematic and nonsystematic error.
OTHER SOURCES OF ERROR
MODELS OF
RELIABILITY
Time Sampling: The Test–Retest Method
Item Sampling: Parallel Forms and Alternate-Forms
Method
Split-Half Reliability
The Spearman-Brown formula
TIME SAMPLING: THE TEST–
RETEST METHOD
Test-retest reliability is an estimate of reliability obtained by correlating pairs
of scores from the same people on two different administrations of the same
test. The test-retest measure is appropriate when evaluating the reliability of a
test that purports to measure something that is relatively stable over time,
such as a personality trait.
example: An estimate of test-retest reliability from a personality profile
might be low if the test taker suffered some emotional trauma or received
counseling during the intervening period. A low estimate of test-retest
reliability might be found even when the interval between testings is
relatively brief.
ITEM SAMPLING: PARALLEL FORMS
AND ALTERNATE-FORMS METHOD
The degree of the relationship between
various forms of a test can be
evaluated by means of an alternate-
forms or parallel-forms coeffi cient of
reliability, which is often termed the
coeffi cient of equivalence. Although
frequently used interchangeably, there
is a difference between the terms
alternate forms and parallel forms.
ITEM SAMPLING: PARALLEL FORMS
AND ALTERNATE-FORMS METHOD
Parallel forms of a test exist when, for each form of the test, the
means and the variances of observed test scores are equal. In
theory, the means of scores obtained on parallel forms correlate
equally with the true score. More practically, scores obtained on
parallel tests correlate equally with other measures.
example: long quiz is similar to midterm exams
ITEM SAMPLING: PARALLEL FORMS
AND ALTERNATE-FORMS METHOD
Alternate forms are simply different versions of a test that have
been constructed so as to be parallel. Although they do not meet
the requirements for the legitimate designation “parallel,” alternate
forms of a test are typically designed to be equivalent with respect
to variables such as content and level of difficulty.
example: long quiz is not as similar to midterm exams,
SPLIT-HALF RELIABILITY
In split-half reliability, the test or measurement instrument is divided into two
halves or subsets of items, and the scores on each half are compared. The
division of items can be done in various ways, such as odd-even split (e.g.,
comparing the scores of odd-numbered items with even-numbered items)
or random split (e.g., randomly assigning items to two halves).
The split-half reliability coefficient is then calculated to determine the
consistency of scores between the two halves. The most common coefficient
used is the Pearson correlation coefficient, which ranges from -1 to +1. A
higher correlation coefficient indicates greater internal consistency or
reliability.
Pearson Product-Moment Correlation:
This formula calculates the Pearson correlation coefficient between the scores
on the two halves of the scale. It measures the linear relationship between the
two sets of scores and ranges from -1 to +1.
The formula is as follows:
Split-half reliability coefficient = 2 * (r / (1 + r))
where
“r” represents the Pearson correlation coefficient between the scores on the
two halves of the scale.
Spearman-Brown Prophecy Formula:
This formula is used to estimate the reliability of a full-length scale based on the
reliability of a shorter version of the scale. It is commonly applied when the scale is
divided into two halves, and the split-half reliability coefficient is known.
The formula is as follows:
Split-half reliability coefficient (full scale) = (2 * Split-half reliability coefficient) / (1 +
Split-half reliability coefficient)
where
“Split-half reliability coefficient” represents the reliability coefficient obtained from the
split-half analysis.
THE SPEARMAN-BROWN FORMULA
The Spearman-Brown formula is used to predict the reliability of a test after
changing the length of the test.
The formula is:
Predicted reliability = kr / (1 + (k-1)r)
where:
k: Factor by which the length of the test is changed. For example, if original test
is 10 questions and new test is 15 questions, k = 15/10 = 1.5.
r: Reliability of the original test. We typically use Cronbach’s Alpha for this,
which is a value that ranges from 0 to 1 with higher values indicating higher
reliability.
THE SPEARMAN-BROWN FORMULA
Suppose a company uses a 15-item test to assess employee satisfaction and the
test is known to have a reliability of 0.74.
If the company increases the length of the test to 30 items, what is the predicted
reliability of the new test?
We can use the Spearman-Brown formula to calculate the predicted reliability:
Predicted reliability = kr / (1 + (k-1)r)
Predicted reliability = 2*.74 / (1 + (2-1)*.74)
Predicted reliability = 0.85
The new test has a predicted reliability of 0.85.
Note: We calculated k as 30/15 = 2.
OTHER METHODS OF ESTIMATING
INTERNAL CONSISTENCY
Inter-item consistency refers to the degree of correlation among all the items on a
scale. A measure of inter-item consistency is calculated from a single administration of
a single form of a test. An index of inter-item consistency, in turn, is useful in
assessing the homogeneity of the test.
The more homogeneous a test is, the more inter-item consistency it can be expected
to have. Because a homogeneous test samples a relatively narrow content area, it is to
be expected to contain more inter-item consistency than a heterogeneous test. Test
homogeneity is desirable because it allows relatively straightforward test-score
interpretation. Test-takers with the same score on a homogeneous test probably have
similar abilities in the area tested. Test-takers with the same score on a more
heterogeneous test may have quite different abilities.
Tests are said to be homogeneous if they contain items that measure a
single trait. As an adjective used to describe test items, homogeneity
(derived from the Greek words homos, meaning “same,” and genos,
meaning “kind”) is the degree to which a test measures a single factor.
In other words, homogeneity is the extent to which items in a scale are
unifactorial. In contrast to test homogeneity, heterogeneity describes the
degree to which a test measures different factors.
A heterogeneous (or nonhomogeneous) test is composed of items that
measure more than one trait.
A test that assesses knowledge only of color television repair skills could
be expected to be more homogeneous in content than a test of
electronic repair.
Kuder-Richardson formula 20, or KR-20, so named because it was the twentieth
formula developed in a series. Where test items are highly homogeneous, KR-20
and split-half reliability estimates will be similar. However, KR-20 is the statistic of
choice for determining the inter-item consistency of dichotomous items,
primarily those items that can be scored right or wrong (such as multiple-choice
items).
KUDER–RICHARDSON FORMULA
KUDER–RICHARDSON FORMULA
KUDER–RICHARDSON FORMULA
KUDER–RICHARDSON FORMULA
An approximation of KR-20 can be obtained by the use of the twenty-first
formula in the series developed by Kuder and Richardson, a formula known
as—you guessed it—KR-21. The KR-21 formula may be used if there is reason
to assume that all the test items have approximately the same degree of
diffi culty. Let’s add that this assumption is seldom justifi ed. Formula KR-21
has become outdated in an era of calculators and computers. Way back
when, KR-21 was sometimes used to estimate KR-20 only because it
required many fewer calculations.
KUDER–RICHARDSON FORMULA
COEFFICIENT ALPHA
A formula that estimates the internal consistency of tests in which the items
are not scored as 0 or 1 (right or wrong).
Note: As you may notice, this looks quite similar to the KR20 formula. The only
difference is that Σpq has been replaced by.
Note: The summation sign (Σ) informs us that we are
to sum the individual item variances. S² is the variance
of the total test score.
Coefficient alpha is a more general reliability
coefficient than KR20 because can describe the
variance of items whether or not they are in a right-
wrong format.
Factor analysis deals with or measures several
different characteristics.
MEASURES OF INTER-SCORER
RELIABILITY
Inter-scorer reliability is the degree of agreement or consistency between
two or more scorers (or judges or raters) with regard to a particular
measure. References to levels of inter-scorer reliability for a particular test
may be published in the test’s manual or elsewhere. If the reliability
coefficient is high, the prospective test user knows that test scores can be
derived in a systematic, consistent way by various scorers with sufficient
training. A responsible test developer who is unable to create a test that
can be scored with a reasonable degree of consistency by trained scorers
will go back to the drawing board to discover the reason for this problem.
USING AND INTERPRETING A
COEFFICIENT OF RELIABILITY
There are basically three approaches to the estimation of reliability: (1) test-retest,
(2) alternate or parallel forms, and (3) internal or inter-item consistency. The
method or methods employed will depend on a number of factors, such as the
purpose of obtaining a measure of reliability.
Test-retest – measuring a property that you expect to stay the same over
time.
Alternate or Parallel Forms – using two different tests to measure the same
thing.
Internal or inter-item consistency – using a multi-item test where all the items
are intended to measure the same variable.
As a rule of thumb, it may be useful to think of
reliability coefficients in a way that parallels many
grading systems: In the .90s rates a grade of A
(with a value of .95 higher for the most important
types of decisions), in the .80s rates a B (with
below .85 being a clear B), and anywhere from .65
through the .70s rates a weak, “barely passing”
grade that borders on failing (and unacceptable).
THE PURPOSE OF THE RELIABILITY
COEFFICIENT
If a specific test of employee performance is
designed for use at various times over the course
of the employment period, it would be
reasonable to expect the test to demonstrate
reliability across time. It would thus be desirable
to have an estimate of the instrument’s test-
retest reliability. For a test designed for a single
administration only, an estimate of internal
consistency would be the reliability measure of
choice. If the purpose of determining reliability is
to break down the error variance into its parts, as
shown in Figure 5–1 , then a number of reliability
coefficients would have to be calculated.
THE NATURE OF THE TEST
Closely related to considerations concerning the purpose and
use of a reliability coefficient are those concerning the nature of
the test itself. Included here are considerations such as whether
(1) the test items are homogeneous or heterogeneous in
nature; (2) the characteristic, ability, or trait being measured
is presumed to be dynamic or static; (3) the range of test
scores is or is not restricted; (4) the test is a speed or a power
test; and (5) the test is or is not criterion-referenced.
CONNECTING SOURCES OF ERROR WITH
RELIABILITY ASSESSMENT METHOD
Standard Errors of Measurement and the Rubber Yardstick
Psychologists working with unreliable tests are like carpenters working with
rubber yardsticks that stretch or contract and misrepresent the true length of a
board. However, as all rubber yardsticks are not equally inaccurate, all
psychological tests are not equally inaccurate.
The standard error of measurement allows us to estimate the degree to which
a test provides inaccurate readings; that is, it tells us how much “rubber” there
is in a measurement.
Note: The larger the standard error of measurement, the less certain we can be
about the accuracy with which an attribute is measured.
Standard error of a score (name of standard error of measurement in other
books)
We never know whether an observed score is the “true” score. However, we can
form intervals around an observed score and use statistical procedures to
estimate the probability that the true score falls within a certain interval.
Note: Common intervals used in testing are the 68% interval, the 95% interval,
and the 99% interval. These intervals are created using Z scores.
Tests with more measurement errors include more “rubber. ”In other words, the
larger the standard error of measurement, the larger the confidence interval.
How reliable is Reliable?
It depends on the type of test used.
Reliability estimates in the range of .70 and .80 are good enough for most purposes in
basic research.
Some people have argued that it would be a waste of time and effort to refine research
instruments beyond a reliability of .90. In fact, it has even been suggested that reliabilities
greater than .95 are not very useful because they suggest that all of the items are testing
essentially the same thing and that the measure could easily be shortened.
In clinical settings, high reliability is extremely important. A test with a reliability of .90
might not be good enough. For a test used to make a decision that affects some
person’s future, evaluators should attempt to find a test with a reliability greater than
.95.
The Standard error of measurement is the most useful index of reliability for the
interpretation of individual scores.
In the domain sampling model, the reliability of a test increases as
the number of items increases. (e.g., a doctor diagnosing a patient
may give a reliable diagnosis using multiple questions rather than
one.)
The prophecy formula for estimating how long a test must be to
achieve a desired level of reliability is another case of the general
Spearman-Brown method through algebraic manipulations of the
general formula.
What to do about Low Reliability?
Psychometric theory offers some options. Two common approaches
are to increase the length of the test and to throw out items that
reduce reliability. Another procedure is to estimate what the true
correlation would have been if the test did not have measurement error
Increase the Number of Items
To find the number of items required, number of items
on the current test × N from the prophecy formula.
Factor and Item Analysis
Two approaches are suggested to ensure that the items
measure the same thing:
Factor analysis: tests are most reliable if they are unidimensional.
The factor should account for considerably more of the variance
than any other factor. Items that do not load on this factor might
be best omitted.
Discriminability Analysis (Form of Item Analysis) examines the
correlation between each item and the total score for the test. If
low, the item drags down reliability and should be excluded.
THANK YOU FOR
LISTENING!
Presented by:
Ishi Faith M. Bautista
Crystal Marian S. Ruiz
PSYCH-3B