Lesson 3: Improving a Classroom-based Assessment Module 3
Learning Targets I can list down the different ways for judgmental item-improvement and other empirically-based procedures I can differentiate judgmental item-improvement from empirically-based procedures I can evaluate which type of test item-improvement is appropriate to use I can compute the index of difficulty, index of discrimination and distracter efficiency I can interpret the computed indices for difficulty and discrimination, and distracter efficiency I can demonstrate knowledge on the procedures for improving a classroom-based assessment.
By now, it is assumed that you have known how to plan a classroom test by specifying the purpose for constructing it, the instructional outcomes to be assessed, and preparing a test blueprint to guide the construction process. The techniques and strategies for selecting and constructing different item formats to match the intended instructional outcomes make up the second phase of the test development process which is the content of the preceding lesson. Introduction
The process however is not complete without ensuring that the classroom instrument is valid for the purpose for which it is intended. Ensuring requires reviewing and improving the items which is the next stage in the process. This lesson offers the pre-service teachers the practical and necessary ways for improving teacher-developed assessment tools.
Judgmental Item-Improvement This approach basically makes use of human judgment in reviewing the items. The judges are: teachers themselves who know exactly what the test for, the instructional outcomes to be assessed, and the items’ level of difficulty appropriate to his/her class; the teacher’s peers or colleagues who are familiar with the curriculum standards for the target grade level, the subject matter content, and the ability of the learners; and the students themselves who can perceive difficulties based on their past experiences.
Teachers’ Own Review It is always advisable for teachers to take a second look at the assessment tools he/she has devised for a specific purpose. To presume perfection right away after its construction may lead to failure to detect shortcomings of the test or assessment tasks.
There are five suggestions given by Popham (2011) for the teachers to follow exercising judgment:
Exercising judgement… Adherence to item-specific guidelines and general item-writing commandments. The preceding lesson has provides specific guidelines in writing various forms of objectives and non-objective constructed-response types and the selected-response type for measuring higher-level thinking skills. These guidelines should be used by the teachers to check how the items have been planned and written particularly and their alignment to their intended instructional outcomes.
Exercising judgement… Contribution to score-based inference. The teacher examines if the expected scores generated by the test contribute to making valid inference about the learners. Can the scores reveal the amount of learning achieved or show what have been mastered? Can the score infer the students’ capability to move on to the next instructional level? Or rather the scores obtained do not make any differences at all in describing or differentiating various abilities.
Exercising judgement… 3. Accuracy of contents. This review should especially be considered when tests have been developed after a certain period of time. Changes that may occurred due to new discoveries or developments can refined the test contents of a summative test. If this happens, the items or the key to correction may be to be revisited.
Exercising judgement… Absence of content gaps. This review criterion is especially useful in strengthening the score-based inference capability of the test. If the current tool misses out on important content now prescribed by a new curriculum standard, the score will likely not give an accurate description of what is expected to be assessed. The teacher always ensures that the assessment tool matches what is currently required to be learned. This is a way to check on the content validity of the test.
Exercising judgement… Fairness. The discussion on item-writing guidelines always give warning unintentionally favoring the uninformed student obtain higher scores. These are due inadvertent grammatical clues, unattractive distracters, ambiguous problems and messy test instructions. Sometimes, unfairness can happen because of due advantage received by a particular group like those seated in the front of the classroom or those coming from a particular socio-economic level. Getting rid of faulty and biased items and writing clear instructions definitely add to the fairness of the test.
Peer review There are schools that encourage peer or collegial review of assessment instruments among themselves. Time is provided for this activity and it has almost always yielded good results for improving tests and performance-based assessment tasks. During these teacher dyad or triad sessions, those teaching the same subject area can openly review together the classroom tests and tasks they have devised against some consensual criteria.
The suggestions given by test experts can actually be used collegially as basis for a review checklist: Do the items follow the specific and general guidelines in writing items especially on: Being aligned to instructional objectives? Making the problem clear and unambiguous? Providing plausible options? Avoiding unintentional clues? Having only one correct answer? Are the items free from inaccurate content? Are the items free from obsolete content? Are the test instructions clearly written for students to follow? Is the level of difficulty of the test appropriate to level of learners? Is the test fair to all kinds of students?
Student Review Engagement of students in reviewing items has become a laudable practice for improving classroom test. The judgment is based on the students’ experience in taking the test, their impressions and reactions during the testing event. The process can be efficiently carried out through the use review questionnaire. Popham (2011) illustrates a sample questionnaire shown in the textbox. It is better to conduct the review activity a day after taking the test so the students still remember the experience when they see a blank copy of the test.
Item-Improvement Questionnaire for Students If any of the items seemed confusing, which ones where they? Did any items have more than one correct answer? If so, which ones? Did any items have no correct answers? If so, which ones? Were there words in any item that confused you? If so, which ones? Were the directions for the test, or for particular sub-sections, unclear? If so, which ones?
Another technique of eliciting student judgment for item improvement is by going over the test with his/her students before the results are shown. Students usually enjoy this activity since they can get feedback on the answers they have written. As they tackle each item, they can be asked to give their answer, and if there is more than one possible correct answer, the teacher makes notations for item-alterations. Having more than one correct answer signals ambiguity either in the stem or in the given options. The teacher may also take the chance to observe sources of confusion especially when answers vary. During this session, it is important for the teacher to maintain an atmosphere that allows students to question and give suggestions. It also follows that after an item review session, the teacher should be willing to modify the incorrect keyed answers.
Empirically-based Procedures Item-improvement using empirically-based methods is aimed at improving the quality of an item using students’ response to the test . Test developers refer to this technical process as item analysis as it utilizes data obtained data separately for each item. An item is considered good when its quality indices, i.e., difficulty index and discrimination index , meet certain characteristics.
Empirically-based Procedures For a norm-referenced test, these two indices are related since the level of difficulty of an item contributes to its discriminability. An item is good if it can discriminate between those who perform well in the test and those who do not. However, an extremely easy item, that which can be answered correctly by more than 85% of the group, or an extremely difficult item, that which can only be answered correctly by 15%, is not expected to perform well as a “discriminator”. The group will appear to be quite homogenous with items of this kind. They are weak items since they do not contribute to “score-based inference”.
Empirically-based Procedures… The difficulty index, however, takes a different meaning when used in the context of criterion-referenced interpretation or testing for mastery. An item with a high difficulty index will not be considered as an “easy item” and therefore a weak item, but rather an item that displays the capability of the learners to perform the expected outcome. It therefore becomes an evidence of mastery.
For objective tests, the responses are usually binary in form, i.e., right or wrong , translated into numerical figures as 1 and 0, for obtaining nominal data like frequency, percentage and proportion. Useful data then are in the form: Total number of students answering the item (T) Total number of students answering the item right (R)
Difficulty Index An item is difficult if majority of students are unable to provide the correct answer. The item is easy if majority of the students are able to answer correctly.
How to determine the difficulty and discrimination index of each item?
Test results…
Step 1. Get the total score of each student and arrange scores from highest to lowest
Step 2. Obtained the upper and lower 27% of the group. Multiply 0.27 by the total number of students, you will get a value of 2.7. The rounded whole number value is 3.0. Get the top three students and the bottom 3 students based on their scores. The top three students are students 2, 5, and 9. The bottom three students are students 7, 8, and 4. The rest of the students are not included in the item analysis. Upper 27% Lower 27%
Step 3. Obtain the proportion of correct answers for every item in the upper 27% and the lower 27% groups. This is done by summating the correct answer per item and dividing it by the total number of students. Examples: 1/3 = .33 2/3 = 0.67 3/3 = 1.00
Step 4. The item difficulty is obtained using the following formula: where pH = proportion in the upper 27% pL = proportion in the lower 27%
Discrimination Index The power of an item to discriminate between informed and uninformed groups or between more knowledgeable and less knowledgeable learners are shown using the item-discrimination index (D). This is an item statistics that can reveal useful information for improving an item. An item discrimination index shows the relationship between the student’s performance in an item (i.e., right or wrong) and his total performance in the test represented by the total score. Item-total correlation is usually part of a package from item analysis. Getting high item-total correlations indicate that the items contribute well to the total score so that responding item-total correlations indicate that the items contribute well to the total score so that responding correctly to these items gives a better chance of obtaining relatively high total scores in the whole test or subtest.
Discrimination Index… Discrimination index shows if a difference exists between the performance of those who scored high and those who scored low in the item. As a general rule, the higher the discrimination index (D), the more discriminating the item is . The nature of the difference however, can take different directions. Positively discriminating item – proportion of high scoring group is greater than that of the low scoring group Negatively discriminating item – proportion of high scoring group is less than that of the low scoring group Not discriminating item – proportion of high scoring group is equal to that of the low scoring group
Computing the discrimination index therefore requires obtaining the difference between the proportion of the high-scoring group getting the item correctly and the proportion of the low-scoring group getting the item correctly using this simple formula:
Since the total in the upper and lower 27% groups are equal the formula can be: To obtain the proportions of the upper and lower groups responding to the item correctly, the teacher follows these steps: Score the test papers using a key to correction to obtain the total scores of the students. Maximum score is the total number of objective items. Order the test papers from highest to lowest score. or D = pH - pL
Split the test papers into halves: high group and lower group For a class of 50 or less students, do a 50-50 split. Take the upper half as the HIGH score group and the lower half as the LOW group. For a big group of 100 or so, take the upper 25% - 27% and the lower 25% - 27%. Maintain equal numbers of test papers for the Upper and Lower groups. Obtain the p-value for the Upper Group and p-value for the Lower Group pUpper = RU/TH; pLower = RL/TH Get the discrimination index (D) by getting the difference between the p-values.
For purposes of evaluating the discriminating power of items, Popham (2011) offers the guidelines proposed by Ebel and Frisbie (1991) shown below. The teachers can be guided on how to select the satisfactory items and what to do to improve the rest.
Distracter Analysis Another empirical procedure to discover areas for item-improvement utilizes an analysis of the distribution of responses across the distracters. Obviously, when the difficulty index and discrimination index of the item seem to suggest its being candidate for revision, distracter analysis becomes a useful follow-up.
Let’s have an example:
Each distractor can have its own item discrimination value in order to analyse how the distracters work and ultimately refine the effectiveness of the test item itself. If we use the above item as an example, the item discrimination concept can be used to assess the effectiveness of each distractor. Consider a class of 100 students, then shall form the upper and lower groups of 30 students each.
Assume the following results are observed: Lower
Now, what’s the IE of every distracter? Distractor Hg Lg IE Interpretation A. it rained all day* 20 10 (20-10)/30 .33 ? B. he was scolded 3 3 (3-3)/30 ? C. he hurt himself 4 16 (4-16)/30 -.4 ? D. the weather was hot 3 1 (3-1)/30 .07 ? * Correct answer