Examples of estimated assignment parameters (figures in parentheses are standard deviations of the estimated parameter values)
1. Introduction
As Computer Supported Collaborative Learning (CSCL) and other forms of collaborative learning are becoming popular in recent years, peer assessment, i.e., the mutual evaluation among learners, is generating some interest (for instance, Davies, 1999 and Akahori & Kim, 2003). Peer assessment has the following advantages:
The learners are more self-reliant and their learning motivation is higher with peer assessment (Weaver & Cotrell, 1986 and Falchikov, 1986).
The opinions of other learners are more effective than grade points in inducing the learner’s self-reflection (Weaver & Cotrell, 1986).
By evaluating others, the assessor is able to learn from the other’s work, which induces self-reflection (Falchikov, 1986).
Feedback from other learners who have similar backgrounds is readily understood (Falchikov, 1986).
It reduces the instructor’s workload, and the learner can receive useful feedback even when there is no instructor (Orpen, 1982).
Useful feedback which the instructor is unlikely to provide can be obtained; a wide range of feedback can be obtained (Orpen, 1982).
When the learners consist of mature adults, evaluation by multiple assessors is more reliable than that by a single instructor (Falchikov, 1986; Orpen, 1982 and Arnold, 1981).
This study is concerned primarily with the advantage 7) above, that is, the use of peer assessment to improve the reliability of evaluations. Falchikov (1986) reports that peer assessments among primary school children were not so reliable, whereas those among junior-high-school students were more reliable. Arnold (1981) introduced peer assessment in a course in medical school, where it was demonstrated that a fair and consistent evaluation took place. Orpen (1982) compared instructor evaluation and peer assessments among university students, and found that not only was there no significant difference between the two when averages were compared, but that peer assessment was in fact more reliable than assessment by a single instructor. The above studies demonstrate that peer assessment is more reliable than an instructor’s evaluation, at least in higher education, but there have been no studies so far on methods to further improve the reliability. Furthermore, certain issues remain in peer assessment, such as:
The assessors may not all share the same assessment criteria,
An assessor may not always be consistent in applying the same assessment criteria,
Treatment of missing data is uncertain.
To resolve these issues, this paper proposes applying Item Response Theory (for instance, Samejima, 1969) to peer assessments, and a method of estimating the parameters. Specifically, we propose a modification of the Graded Item Response model (Samejima, 1969) that incorporates the assessors’ evaluation criterion parameters. This model has the following advantages:
A consistent assessment based on a common scale is possible even when the assessors have different evaluation criteria.
The reliability of the assessors is taken into account to evaluate the learners, which produces a more reliable evaluation.
Model parameters can be readily estimated from incomplete or missing data, and the missing data themselves can be estimated.
As a result of 1)-3) above, it is possible to assess the learners outcomes with better estimation accuracy.
In addition, we propose introducing two indices to analyze assessors: 1) the strictness of the assessor’s evaluation criteria, and 2) the assessor’s consistency. The proposed method was applied to real data, which demonstrated its validity.
2. Peer Assessment System
The Learning Management System (LMS) “Samurai” (Ueno, 2005), developed by one of the authors, supports a bulletin board system. A learner may set up his/her “room,” as shown in Figure 1, where the learner can post his/her assignment work and other remarks. Students may “visit” the rooms of other students, where they can post a critique of the assignment work, exchange views, and support each other to solve assignment problems (Ueno, 2006). The example of Figure 1 displays a student submitting a weekly report for an undergraduate course on e-learning. The bulletin board, at the lower half of the screenshot, shows the other learners’ critiques and opinions of the report. The learner who submitted the report can take these inputs into consideration and rework his/her assignment or rewrite the report. The five buttons shown at the upper left are used for assessing the assignment work, and consist of –2 (Bad), -1 (Poor), 0 (Fair), 1 (Good), and 2 (Excellent). Each room presents an online listing of the average rating and the number of assessors. After converting the points assignment item j, (j=1,…,M) of learner i, (i=1,…,N) given by assessor r, (r=1,…,n), who gave the ranking category x = k, k = 1, 2, …, m (m=5 in the present case), can be obtained as follows: [-2,-1,0,1,2] respectively to [1,2,3,4,5], the data for
where
Because all of the element data of X cannot be collected in most cases, each element often contains missing data. This is represented by the missing data of data X as
where
This study proposes applying the item response theory to the data X above.
3. Item Response Theory for Peer Assessment
3.1 Item Response Theory
With the widespread use of computer testing, the Item Response Theory (Samejima, 1969), which is a recent test theory based on mathematical models, is widely being employed in areas such as human resource measurements, entrance exams, and certification tests. It has the following advantages:
It is possible to assess ability while minimizing the effects of heterogeneous or aberrant items which have a low estimation accuracy.
The learner’s response to different items can be assessed on the same scale.
Missing data can be readily estimated.
This paper proposes the application of Item Response Theory to data obtained in peer assessments, where the following issues associated with peer assessment can be resolved because of the above advantages.
The assessors may not all share the same evaluation criteria,
An assessor may not be always consistent in applying the same assessment criteria,
Treatment of missing data is uncertain.
While many models have been proposed with regard to Item Response Theory, many peer assessments employ multi-grade Likert Scales (five grades in this study). In this study, therefore, we employ a modification of the Graded Item Response model (Samejima, 1969). This model is used when the assessment of an item can be expressed by points in
where
Examples of response curves for the model with five grade levels (1-5) are shown in Figure 2, where the abscissa is the learner’s ability
3.2 Item Response Theory for peer assessment
In this study, the assessors’ evaluation criterion parameters are incorporated into the Item Response Theory model. We assume that the response, given in
where
For instance, Figure 2 can be viewed as examples of response curves that show assessor
The characteristics of assignment
4. Characteristic indices of assessor
We introduce the characteristic indices of assessors which are derived from the estimated assessor parameters.
4.1 Strictness of assessor’s evaluation criteria
Denoting by
When
Assignment |
|
|
Subject | Content |
# 1 | 1.76 (0.14) | -0.39 (0.04) | The Internet and society | Yesterday, an incident occurred in which a 17-year old youth bashed the head of an infant with a hammer. The arrested youth later testified that seeing various photos of cruelty on the Internet had incited the aggressive impulse. Analyze the causal connection between information access on the Internet and such incidences of violence, and discuss how such incidences may be prevented. |
# 2 | 0.61 (0.02) | -0.17 (0.02) | Computers in our lives | Research the extent of information technology use in the local municipality of your hometown, and discuss some of the issues you find. |
# 3 | 2.48 (0.18) | 0.85 (0.03) | The Internet and privacy | Investigate ways in which private (personal) information can be leaked out via the Internet, and discuss ways of preventing them. |
Assessor |
|
|
|
|
|
|
1 | 0.91 | -0.90 | -4.02(1.03) | -3.52(0.90) | 1.89(0.45) | 2.05(0.48) |
2 | 0.71 | -1.55 | -7.47(0.62) | 1.39(0.00) | -1.39(0.00) | 1.24(0.42) |
3 | 0.96 | 0.39 | -8.47(2.12) | -5.52(1.25) | 6.06(0.36) | 9.56(0.89) |
4 | 0.99 | 3.58 | 2.29(1.89) | 3.18(1.13) | 3.98(1.13) | 4.87(0.84) |
5 | 0.94 | -3.87 | -8.29(0.87) | -3.94(1.26) | -2.12(2.08) | -1.22(0.81) |
small, all response curves will shift to the left, and the learner requires only a low ability to receive high “grades.”
4.2 Consistency of assessor
It is preferable that the set of parameters
Thus, the consistency
where
5. Application example
5.1 Data
In this section, we describe an application example of the proposed model using real data. The used data was collected from an e-learning course offered in 2005 on “Information Society and Information Ethics.” The details are as follows:
Initial enrollment: 97 (of which 21 withdrew in midcourse)
Assignments: submittal of 13 papers, one per week.
Number of bulletin board comments: 782
Number of missing data: 384
5.2 Example of estimation of assignment parameters Data
In this section, we present an example of the estimated assignment parameters. Table 1 shows part of the estimated assignment parameters. From the estimated values of the discriminatory power parameter
6-instructor | 6-mean | 6-θ | 13 -instructor | 13 -mean | 13 -θ | |
6-instructor | 1 | |||||
6-mean | 0.89 9 | 1 | ||||
6-θ | 0.890 | 0.782 | 1 | |||
13 -instructor | 0.593 | 0.571 | 0.511 | 1 | ||
13 -mean | 0.875 | 0.908 | 0.848 | 0.548 | 1 | |
13 -θ | 0.848 | 0.742 | 0.96 1 | 0.510 | 0.80 6 | 1 |
5.3 Example of estimation of assignment parameters Data
In this section, we present an estimation example of assessor parameters as well as values of assessor reliability. Table 2 shows part of the estimated assessor parameters. The response curves of the assessors are shown in Figures. 3-7, derived from the parameters for the condition that the assignment parameter for discriminatory power is 1 and that for degree of difficulty is zero. Assessor #1 is highly consistent and has appropriate evaluation criteria, but the response curves indicate that he/she has a tendency to give 2, 3, and 5 as the grade point. Assessor #2 has a low consistency, and employs rather lax evaluation criteria. The response curves show that he/she prefers to give 3, 4, and 5 as grades. His/her low consistency is seen by comparing the parameter values, where
6. Evaluation of estimation accuracy of learner’s ability
The greatest advantage of the proposed method lies in the consideration of assessor characteristics, based on which the learners’ ability can be expected to be estimated with greater accuracy. In this section, we evaluate the prediction accuracy of the learners’ evaluation points when 1) evaluation is done by a single instructor, 2) evaluation is computed from the mean value of peer assessments, and 3) the proposed method is employed. The three types of evaluation values for a total of 13 submitted papers are analyzed by comparing them with those estimated from data on the first six papers. A correlation matrix was computed in which “6-instructor” denotes the evaluation of type 1) based on the first six papers, “6-mean” that of type 2), “6-θ” that of type 3), “13-instructor” the evaluation of type 1) based on all 13 papers, “13-mean” that of type 2), and “13-θ” that of type 3), which is shown in Table 3. The correlation between evaluation values of the assignment data for six papers and 13 papers is highest for the proposed method at 0.961, which indicates its high prediction rate. It is noteworthy that in the evaluation by a single instructor, the correlation between six and 13 papers is extremely low. The observation that the mean value of peer assessments provides a better evaluation than instructor evaluation agrees with the findings of previous studies (Falchikov, 1986; Orpen, 1982 and Arnold, 1981). Furthermore, Ikeda (1992) has reported that when the instructor evaluated an identical student paper twice, with a one-week interval in between, the correlation was still at most around 0.7, suggesting that evaluation by a single instructor is problematic in terms of reliability, especially when there are a large number of learners. In the present case, the instructor’s evaluation of six assignments has a correlation over 0.8 with the other methods, but that of 13 assignments has a correlation of around 0.5 with the other methods, showing that the instructor’s evaluation diverged considerably from the others. That the present method exhibits a higher reliability than the mean value of peer assessment can be explained by the former’s consideration of the heterogeneity existing among the assessors, such as in assessors #2-5 in Table 2. Our findings above show that the evaluation values obtained by the present method have a higher reliability than those obtained from the mean value of the assessors or the instructor’s subjective judgment.
7. Conclusion
In this paper, we proposed an application of the Item Response Theory for peer assessment, and discussed a method for parameter estimation. Specifically, The proposed model is a modified Graded Item Response model which incorporates the assessor’s assessment criterion parameters. The model was applied to real data, which showed that the present method yields evaluation values that have a higher reliability, with greater predictive efficiency, than the instructor’s assessment or the mean assessment value of all assessors. Our results demonstrate that, in large-scale e learning courses or collaborative learning situations, the application of the present method to peer assessment among the learners yields evaluations that are more reliable than those of a single instructor.
References
- 1.
Davies P. 1999 Learning through assessment, OLAL on-line assessment and Learning, Proceedings of the 3rd Computer Assisted Assessment Conference,75 78 ,0-95332-103-7 UK, June, 1999, The Flexible Learning Initiative, Loughborough University. - 2.
Akahori K. Kim S. M. 2003 Peer Evaluation using the Web and some Comparisons Meta-cognition between Experts and Novices, Proceedings of World Conference on Educational Multimedia, Hypermedia and Telecommunications (EDMEDIA),1484 1487 , Honolulu, Hawaii, USA, June, 2003, Chesapeake, VA: AACE. - 3.
Weaver W. Cotrell H. W. 1986 Peer evaluation: a case study, ,11 25 39 ,0742-5627. - 4.
Falchikov N. 1986 Product comparisons and process benefits of collaborative peer group and self assessments , Assessment and Evaluation in Higher Education,11 2 146-166,0260-2938. - 5.
Orpen C. 1982 Student versus lecturer assessment of learning, Higher Educatio n,11 5 567 572 -,0018-1560. - 6.
Arnold L. 1981 Use of peer evaluation in the assessment of medical students, ,56 1 35 42 . - 7.
Samejima F. 1969 Estimation of latent ability using a response pattern of graded scores , Psychometric Monograph,17 - 8.
Ueno M. 2005 Development of LMS “Samurai“ and e-learning practice, Proceedings of Annual Conference of Educational Information System,79 86 . - 9.
Ueno M. Souma M. Kinoe K. Yamashita Y. 2006 e-Learning management in Nagaoka University of Technology, Journal of educational technology,29 3 217-229. - 10.
H. Ikeda, Science of test (Japan culture science publisher, 1992).