8. Validity reliability-NUFN667.JHJHJHJHJpdf

ichlasiaainulfitri 52 views 70 slides Sep 15, 2025
Slide 1
Slide 1 of 70
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70

About This Presentation

HGHG


Slide Content

Validity and Reliability
Nipa Rojroongwasinkul, Ph.D.
Institute of Nutrition
Mahidol University

Reliability is fundamental to all aspects of
measurement (the accuracy of the actual
measuring instrument or procedure), because
without it we cannot have confidence in the
data we collect, nor can we draw rational
conclusions from those data.
Reliability

1. Test-retest reliability (Stability)
We administer the same test to the same
sampleon two different occasionsand see
if they yield the same result.
Types of Reliability

♦Test-retest reliability
-Test-retest Intervals
Because the stability of a response
variable is such a significant factor, the time
interval between tests must be considered
carefully. Intervals should be far enough
apart to avoid fatigue, learning, or memory
effects, but close enough to avoid genuine
changes in the measured variable.

-Carryover and Testing Effects
With two or more measures, reliability
can be influenced by effect of the first test on
the outcome of the second test.
For example, practice or carryover effects can
occur with repeated measurements, changing
performance on subsequent trials.

2. Rater reliability (Reproducibility)
Many clinical measurements require that
a human observer, or rater, be part of the
measurement system. In some cases, the
rater is the actual measuring instrument.
Types of Reliability (cont.)

-Inter-rater reliability
Used to assess the degree to which
different raters/observers give consistent
estimates of the same phenomenon.
♦Rater reliability

-Intra-rater reliability
The stability of data recorded by one
individual across two or more trials.
In a test-retest situation, intra-rater
reliability and test-retest reliability are
essentially the same estimate.
♦Rater reliability (cont.)

3. Alternate forms (Parallel forms) reliability
Used to assess the consistency of the
results of two tests constructed in the same
way from the same content domain.
Types of Reliability (cont.)

Many measuring instruments exist in two or
more versions, called equivalent, parallel or alternate
form.
Interchange of these alternate forms can be
supported only by establishing their parallel
reliability.
Alternate forms reliability testing is often used
as an alternative to test-retest reliability with paper-
and-pencil tests, when the nature of the test is such
that subjects are likely to recall their responses to
test items.
ExampleScholastic Aptitude Test (SAT), Graduate
Record Examination (GRE)

Validity
•The ability of a scale to measure what it is
intended to measure(the study's success at
measuring whattheresearchers set out to
measure)
•The extent to which a measure reflects the
real meaning of the concept under consideration
•The extent to which a measure reflects the
opinions and behaviors of the population under
investigation
•Can not be valid unless also reliable

Validity
•Depends on the Purposeof the measure
-E.g. a ruler may be a valid measuring
device forlength, but isn’t very valid for
measuring volume
•Measuring what ‘it’ is supposed to
•Must be inferredfrom evidence; cannot be
directly measured

1. Face validity
2. Content validity
3. Pragmatic (criterion-related) validity
A. Concurrent validity
B. Predictive validity
4. Construct validity
A. Convergent validity
B. Discriminant validity
Types of Validity

1. Face Validity
Indicates that an instrument appears to
test what it is supposed to; the weakest form
of measurement validity.
This type of validity relies basically upon
the subjective judgment of the researcher.

Face Validity (cont.)
It asks two questions which the researcher
must finally answer in accordance with best
judgment:
(1) Is the instrument measuring what it is
supposed to measure?
(2) Is the sample being measured
adequate to be representative of the behavior
or trait being measured?

2. Content Validity
Indicates that the items that make up an
instrument adequately sample the universe of
content that defines the variable being
measured. Therefore, the instrument does
contain all the elements that reflect the variable
being studied. Most useful with questionnaires,
examination and inventories.
This type of validity is sometimes equated
with face validity.

Content Validity (cont.)
For example, we are interested in the content
validity of questions being asked to elicit familiarity
with a certain area of knowledge, content validity
would be concerned with how accurately the
questions asked tend to elicit the information
sought.
There are no statistical indiciesthat can
assess content validity. Claims for content validation
are made by a panel of “experts”who review the
instrument and determine if the questions satisfy the
content domain. This process often requires several
revisions of the test. When all agree that the content
domain has been sampled adequately, content
validity is supported.

is the most practical and objective approach
to validity testing. It is based on one test to predict
results obtained on external criterion. The test to be
validated, call the target test, is compared with a
gold standardor criterion measure that is already
established or assumed to be valid.
For example, we can investigate the validity of heart
rate (the target test) as an indicator of energy cost
during exercise by correlating it with values
obtained in standardized oxygen consumption
studies (the criterion measure).
3. Criterion-Related (Pragmatic) Validity

♦Criterion-related validity is separated into 2
components:
1. Concurrent validity
2. Predictive validity
3. Criterion-Related Validity (cont.)

1. Concurrent validity
is studied when the measurement to be
validated and the criterion measure are taken at
relatively the same time(concurrently), so that they
both reflect the same incident of behavior.
Concurrent validity is also useful in situations
where a new or untested tool is potentially more
efficient, easier to administer, more practical, or safer
than another more established method, and is being
proposed as an alternative.
• e.g., Does a new version of an IQ test more efficient
than the past versions?
3. Criterion-Related Validity (cont.)

2. Predictive validity
attempts to establish that a measure will be a
valid predictor of some future criterion score.
To assess predictive validity, a target test is
given at one session and is followed by a period of
timeafter which the criterion score is obtained.
• e.g., Scholastic Aptitude Test (SAT) scores: Do they
predict college GPA?
3. Criterion-Related Validity (cont.)

4. Construct Validity
reflects the ability of an instrument to
measure an abstract concept, or construct.
A construct is any concept (not real),
such as honesty, which cannot be directly
observed or isolated. Construct validation is
interested in the degree to which the
construct itself is actually measured.

♦Assessing construct validity:
–Convergent validity
–Discriminant(Divergent) validity
4. Construct Validity (cont.)
The construct validity of a test can be
evaluated in terms of how its measures relate
to other tests of the same and different
constructs. In other words, it is important to
determine what a test does measure as well as
what it does not measure.

♦Convergent validity:
–Measuring the same concept with very
different methods
–If different methods yield the same results, then
convergent validity is supported
–e.g., Different survey items to used to measure
decision-making style -closed and open-ended
• Code for decision-making style from
open-ended responses
• High score on scale = more compensatory
responses
Convergent Validity

We theorize that all four items reflect the idea of self esteem (this is why I labeled the
top part of the figure Theory). On the bottom part of the figure (Observation) we see
the intercorrelationsof the four scale items. This might be based on giving our scale
out to a sample of respondents. You should readily see that the item intercorrelations
for all item pairings are very high (remember that correlations range from -1.00 to
+1.00). This provides evidence that our theory that all four itemsare related to the
same construct is supported.

♦Discriminant validity:
indicates that different results, or low
correlations, are expected from measures that
are believed to assess different characteristics.
To establish discriminant validity, you need
to show that measures that should not be
related are in reality not related.
Ex.the results of an intelligence test should
not be expected to correlate with results of a
test of gross motor skill
Discriminant (Divergent) Validity

There are four correlations between measures that reflect different
constructs, and these are shown on the bottom of the figure (Observation).
You should see immediately that these four cross-construct correlations are
very low(i.e., near zero) and certainly much lower than the convergent
correlations in the previous figure.
The correlations do provide evidence that the two sets of measures are
discriminated from each other.

♦Internal
–Controlling for other factors in the design
• Validity of structure, sampling, measures,
procedures
• Claims regarding what happened in the study
♦External
–Looking beyond the design to other cases
• Validity of inferences made from the conclusions
• Claims regarding what happens in the real world
Validity & Research Design

Threats to Validity
Threats to Internal Validity
1. History
is a plausible explanation for experimental
effects when some outside eventthat affects outcome
measures occurs duringthe course of a study. In any
study that may involve an evaluation, treatment and
further evaluation, some subjects will undergo events
that may affect the final evaluation.
For example, in a study of the effects of an exercise
programme upon hypertension, some of the patients
might take up additional exercise, such as playing
tennis.

2. Maturation
concerns not events, but the mere
passage of time, and therefore may be of
particularly serious concern in health science
research. Maturation refers to time-
dependent internal changes in subjects.
As children grow older certain skills
appear and develop independent of treatment
or training.

3. Testing
If the same instrumentis employed in
pre-treatment and post-treatment evaluations,
subjects are more familiarwith the instrument
in the post-treatment situation and usually
score higher. These are sometimes called
practice effects.

4. Instrumentation
If the instrument of measurement is a
human observer, the instrument itself may
exhibit practice effects. As they gain
experience, raters tend to perceive phenomena
differently.
Mechanistic instrumentationsuch as
measurement scales and even physical
equipment may represent instrumentation
threats to validity either if the device is
differentially sensitive at various levelsalong
the range of measurement or if the equipment
itself requires (and does not receive)
appropriate adjustment and recalibration.

5. Statistical regression
(Regression to the mean)
This is a special effect that originates from the
unreliability of test measures. Often, clinical research
involves the selectionof patients for study who have
achieved particularly low or high scoreson one or more
measures, for example the most depressed patients or
those with the highest measured cholesterol levels. If you
test these people again-regardless of what you do to them
in the interim-things will also have leveled off for them.
Scores will move in the direction of the “average” score.
This high initial scores will tend to drop, low initial scores
will tend to increase, and moderate scores will tend to
change very little.This happens because on the second
measurement, the measurement error tends to be less.

5. Statistical regression(Cont.)
If subjects are chosen on the basis of very good or
very poor performanceon an initial assessment, they
are likely to include a number of cases in which
measurement error is quite high, and proportionately
few with small measurement error. In such cases the
regression to the meanphenomenon is a possibility,
as on the second measurementthere is likely to be
less extreme measurement error, on average.

6. Selection or assignment errors
concerns the way in which subjects are
placed into experimental groups. The groups
being compared may, due to bad assignment
or selection procedures, be different at the
outset, rather than as a result of any treatment
effects.
This might well happen if the subjects
were not randomly assigned into treatment
groups. Appropriate sampling procedures help
to ensure against this threat to validity.

7. Mortality
deals with the loss of subjectsto the
research over timefor various reasons. It is
particularly important for the researcher to
note, whenever possible, reasons for the
defection of subjects and to pay especially
close attention to attrition rates across
experimental conditions.
As the subjects who drop outmight be
differentto those who stay, the experimental
and the control group no longer remain
equivalent.

8. Selection-maturation
threats the various experimental groups
experience maturational change at different
rates.

9. Selection-history
is particularly a threat when experimental
groups are separate geographically in different
statesor even different hospitals. An event of
localimportance may well affect only one of
your groups.

10. Selection-instrumentation
occur when experimental groups’ average
scores differ on an instrumentwhere the
intervals are not equal across the entire range
of the measure.
An examplewould be a thermometer that was
more sensitive below 100degrees than above
200degrees.

Threats to External Validity
1. Interaction of selection and treatment
A particular threat to external validity is
the use of volunteer subjects. Volunteers
often have characteristics that differ from
those of the general population.

2. Interaction of setting and treatment
involves the extent to which the experimental
setting resembles the setting in which treatment
will be received in post-experimental situations.
This problem is less severe in the health
sciences than in the behavioral and social
sciences, where college studentsoften serve as
subjects for research intend to be generalized to
industrial or other non-academic settings. The
control and similarity of medical care facilities
minimizes this threat for health science
professionals.

3. Interaction of history and treatment
involves differences in the ways that various treatment
groups may be affected by events occurring during the
course of the research.
For example, the treatment might be either enhanced
by an outside event while the control group remains
unaffected.
At this point you may be wondering how researchers
ever manage to design a completely valid study. The truth is
that no one ever does. There is no perfect study, only studies
containing degrees of imperfection. This is why studies must
be replicated in other settings, by other researchers, using
other operations and methods. Good researchers are
constantly aware of the pitfalls of study design presented by
threats to validity and try to avoid as many of them as
possible.

Reliability and Validity
Reliability
•How accurate or
consistent is the measure?
•Would two people
understand a question
in the same way?
•Would the sameperson
give the sameanswers
under similar
circumstances?
Validity
•Does the concept measure
what it is intended to
measure?
•Does the measureactually
reflect theconcept?
•Do the findings reflectthe
opinions, attitudes,and
behaviors of thetarget
population?

Example
Supposethat you have bathroom weight scalesand these
weight scales are broken. The weight scales willrepresent
the methodology. One person weighs youwith thesescales
and obtains a result. Then, theweight scales are passed
along to another person.The second person follows the
same procedure, usesthe same weight scales and weighs
you. The samebroken weigh scales are used. The two
people, usingthe same broken weight scales, come to
similarmeasures. The results are reliable. The results are
obtained by two (or perhaps more) people using thefaulty
scale. Although the results are reliable, theymay not be
valid. That is, by using the faulty scales,
the results are not a true indicator of the real weight.

Not Valid and Not Reliable

Some Validity Not Reliable

Not Valid but Reliable

Valid and Reliable

The Reliability and Validity Relationship
•An instrument that is validis always reliable
•An instrument that is not validmay or may
not be reliable
•An instrument that is reliablemay or may
not be valid

Bias (systematic error)
-deviation from truth in one direction
-a dirty dirt
Random error (random variation, chance, noise)
-deviation from truth in both directionsi.e., above &
below
-mean = pop’n true value
-a clean dirt
-cannot be totally eliminated
-estimated and reduced by statistics
In most situations, there are both bias and chance.
Measured value = True value + Systematic error
+ Random error

Sources of Measurement Error
1.Individual taking the measurements
(often called the tester or rater)
2.Measuring instrument
3.Variability of the characteristic being
measured

Sources of Measurement Error (cont.)
Many sources of error can be minimized
through careful planning, training, clear
operational definitions and inspection of
equipment.
Therefore, a testing protocol should
thoroughly describe the method of
measurement, which must be uniformly
performed across trials.
Isolating and defining each element of the
measure reduces the potential for error,
thereby improving reliability.

-Dichotomous (2 x 2 table)
-sensitivity, specificity, accuracy
Gold Standard (Truth)
+ve -ve
Test +ve a b
–ve c d
a+c b+d
Sensitivity = a / (a+c)
Specificity = d / (b+d)
Accuracy = (a+d) / (a+b+c+d)
-Continuous
1–sample (paired) t–test
1) Statistical measures of validity:

Assumptions:
-Subjects are independent
-Observers are independent
-Categories of (nominal, ordinal) scale are
independent,
mutually exclusive
exhaustive
2) Statistical measures of reliability:

-Nominal data
Dichotomous 2 x 2 table: Kappa
Polychotomousr x r table: Overall Kappa
Individual Kappa
Degree of agreement:
K >0.75 :excellent agreement beyond chance
0.40 K 0.75 : fair -good
K < 0.40 : poor

Kappa
SPSS

Sy mme tric Me a s ure s
.245 .134 2.498 .013
100
Ka ppaMe asure o f Agreement
N of Va lid Case s
Va lue
Asymp .
Std . Error
a
Ap prox. T
b
Ap prox. Sig .
Not assuming the null hypothesis.a.
Using the asymptotic standard error assuming the null hypothesis.b. SPSS Output

-Continuous data:Intraclasscorrelation (ICC,r
I
)
ICC:
1. 2 observers
2. Each subject can have different # of observers
3. Be applied to ordinal data (when intervals b/t category
are assumed to be equivalent)

ICC

References
1. Department of Clinical Epidemiology and Biostatistics, McMaster University,
Hamilton, Ontario. Clinical Disagreement: I. How often it occurs and why.
Can Med Assoc J 1980; 123: 499-504.
2. Fleiss JL, Levin B, Cho Paik M. Statistical Methods for Rates and Proportions
3rd Ed. New York : John Wiley & Sons, 2002.
3. Fletcher RH, Fletcher SW, Wagner EH. Clinical Epidemiology : the
Essentials.
Baltimore : Williams &Wilkins, 1988.
4. Leedy D. P. Practical Research, Planning and Design. 3
rd
ed. New York:
Macmillan Publishing Company, 1985.
5. Maclure M, Willett WC. Misinterpretation and Misuse of the Kappa Statistic.
Am J Epidemiol 1987; 126: 161-9.

References (Cont.)
6. Oyster C.K., Hanten W.P., Llorens L.A.Interduction to Research.
Philadelphia: Lippincott, 1987.
7. Portney LG, Watkins MP. Foundations of Clinical Research: Application to
Practice. Connecticut : Appleton & Lange, 1993.
8. กัลยา วานิชย์บัญชา. สถิติส าหรับงานวิจัย พิมพ์ครั้งที่ 2 กรุงเทพฯ : ภาควิชา
สถิติ คณะพาณิชยศาสตร์และการบัญชี จุฬาลงกรณ์มหาวิทยาลัย, 2549
9. สังวาลย์ รักษ์เผ่า. ระเบียบวิธีวิจัยและสถิติในการวิจัยทางคลินิก เชียงใหม่:
โครงการต ารา คณะแพทยศาสตร์ มหาวิทยาลัยเชียงใหม่, 2539