LEARNING OBJECTIVES A t the end of the module, the students should be able to: i dentify the criteria of a good test; e xplain the meaning of each criterion; and d iscuss the different statistical formulas common to language test.
A. Criteria of a good test
In creating a valid and reliable language test for general or specific purposes, a qualitative and quantitative test analyses are of the utmost importance.
A complete formal analysis requires a thorough psychometric knowledge, whereas informal analyses are less rigid and can be undertaken with relative ease. The analysis described below is feasible for most test developers and will improve the test quality considerably.
Prior to the analysis component, there are a few key concepts which are central to good tests. Relevance Representativity Authenticity Balance Validity Reliability
Relevance The extent to which it is necessary that students are able to perform task x. Representativity The extent to which task x represents a real situation. Authenticity The extent to which the situation and the interaction are meaningful and representative in the world of the individual user.
Balance The extent to which each relevant topic/ability receives an equal amount of attention. Validity The extent to which the test effectively measure what it is intended to measure. Reliability Refers to the consistency and stability with which a test measures performance.
Sub classification of validity: Concurrent validity Construct validity Content validity Convergent validity Criterion-related validity Discriminant validity Face validity Predictive validity
Concurrent validity A test is said to have in concurrent validity if the scores it gives correlate highly with a recognized external criterion which measures the same area of knowledge or ability.
Construct validity A test is said to have in construct validity if scores can be shown to reflect a theory about the nature of a construct or its relation to the other constructs. It could be predicted, for example, that two valid test of listening comprehension would rank learners in the same way, but each would have a weaker relationship with scores on a test of grammatical competence.
Content validity A test is said to have in content validity if the items or tasks of which it is made up constitute a representative samples of items or tasks for the area of knowledge or ability to be tested. These are often related to a syllabus or course.
Convergent validity A test is said to have in convergent validity when there is a high correlation between scores achieved in a different test measuring the same construct (irrespective of method). This can be considered an aspect of construct validity.
Criterion-related validity A test is said to have in criterion-related validity if a relationship can be demonstrated between test scores and some external criterion which is believed to be a measure of the same ability. Information of criterion relatedness is also used in determining how well a test predicts future behavior.
Discriminant validity A test is said to have in discriminant validity if the correlation it has with tests of a different trait is lower than correlation with test of the same trait, irrespective of testing method. This can be considered an aspect of construct validity.
Face validity The extent to which a test appears to candidates, or those choosing it on behalf of candidates, to be an acceptable measure of the ability they wish to measure. This is a subjective judgment rather than one based on any objective analysis of the test, and face validity is often considered not to be a true form of validity. It is sometimes referred to as ‘test appeal’.
Predictive validity An indication of how well a test predicts future performance in a relevant skill.
Factors that influence validity Appropriateness of test items Directions Reading vocabulary and sentence structure Difficulty of items Construction of test items Length of the test Arrangement of items Patterns of answers
A number of variables influence test reliability Specificity - Questions should not be open to different interpretation. Differentiation – The test discriminates between good and poor students Difficulty – The test has an adequate level of difficulty Length – The test contains enough items. In multiple choice at least 40 items are required. Time – Students should have sufficient time to perform a test/task. Item construction – A well-constructed question is better than poor one.
Over the past years, the focus of test construction has shifted from reliability to validity and more specifically construct validity. Additionally, test are increasingly considered as part of the educational practice.
The more reliable a test is, the less random error it contains. A test which contain, systematic error, e.g. bias against a certain group, may be reliable but not valid.
Possible reasons for the inconsistency of an individual’s score in a test. Scorer’s inconsistency Limited sampling of behavior Changes in the individual himself
Factors affecting reliability. Factors which influence the reliability of a test. Objectivity Difficulty of a test Length of a test Adequacy Testing condition Test administration procedures
B . Importance of quantitative analysis
PURPOSE Quantitative analysis is meant to give some idea about the reliability of the test. It is not always easy to determine on sight which questions are unclear or problematic in some way. Statistical data can make problematic items more visible.
GETTING STARTED During the development phase of the test development process, a sample group of at least 20 representative end-users is gathered to whom the test is administered and solved using a statistical program. Usually, the help of statistician is necessary at this point. When this has been done, the descriptive statistics, the correlations, and the item reliability analyses can be checked.
C . Common Statistical Formulas
1. DESCRIPTIVE STATISTICS Descriptive Statistics are intended to offer general idea about the test scores. A review of the following important terms and concepts of descriptive statistics is important: N indicates the number of test reviewed Minimum singles or the lowest score from the population. Maximum singles or the highest score from the population.
Mean refers to the average of score Std . Deviation (SD) is the mean deviation of the values from their arithmetic mean. A small SD implies that in general the scores do not deviate much from the mean.
2. CORRELATIONS Correlations are illustrated by scatter plots which are similar to line graphs in that they use horizontal and vertical axes to plot data points; serving a very specific purpose. Scatter plots show how much one variable is affected by another. The relationship between two variables is called their correlation.
Correlation indicates the strength and relation of a linear relationship between two random variables. Correlation are always situated on the -1 to 1 spectrum. The closer a correlation is to either end of the spectrum, the stronger the relationship.
A relationship statistically significant, if Sig. <0.5. 12.5 7 .5 2.5 10 5 12.5 7 .5 2.5 10 5 12.5 7 .5 2.5 10 5 Figure 1. Perfect Positive Correlation
Scatter plots usually consist of a large body of data. The closer the data points come will plotted to making a straight line, the higher correlation between the two variables, or the stronger the relationship. See figures below,
A perfect positive correlation is given the value of 1. A perfect negative correlation is given the value of -1 . If there is absolutely no correlation present the value given is zero. The closer the number is to 1 or -1. The stronger the correlation or the stronger the relationship between the variables. The closer the number is to zero, the weaker the correlation.
If the data make a straight line going from the origin out to high x- and y- values, then the variables are said to have a positive correlation . If the line goes from the high value on the y-axis down to a high value on the x-axis, the variables have a negative correlation. In the language tests, correlation can merely serve as an indicator of reliability, but very low correlation mostly mean that something is wrong.
If the correlation are generally significant on the highest level (99%), except for example the listening test, this may mean that the listening test does not differentiate between the most able and the least able test takers.
3. ITEM RELIABILITY An item reliability analysis indicates the discriminatory potential of a test item (i.e. does it differentiate between the most able and the least able test takers? As in standard correlations, a very reliable item (with a highly discriminatory capacity) would score close to -1 or 1. items are considered unreliable if they score in between .3 and -.3.