Open-ended items played an important role in evaluating students’ skills, such as analysis, synthesis, and problem solving skills. During scoring open-ended items, rater effects would be occurred due to lack of standard answers, also lack of consistent cognition for the rating rules between different raters. As a consequence, the scoring results would be affected by rater effects. How to estimate person, item and rater parameters precisely is an important issue. Some researchers formulated the GRM-based multilevel facets model, which called grade response multilevel facets model (GR-MLFM), to estimate person ability and handle rater effect when the tasks are successively processing. They used two simulation studies to examine parameter recovery for the unconditional model (none predictors were added into the model) of GR-MLFM. Results show that the model can recovery all the parameters very well, GR-MLFM is useful and reasonable; also results show that the random effect model are more suitable than the fixed model. The purpose of current study was to examine the reasonable of GR-MLFM when both person and rater predictors were added into the model, which called full model of GR-MLFM. One simulation studies and an empirical study was conducted to evaluate the feasibility of GR-MLFM.
For the simulation study, 2 levels were formulated, level 1 was an IRT model, and in level 2, the gender of student and 4 predictors of raters were considered. R software was applied to generate person response matrix, after that, OpenBUGS software, which based on MCMC algorithm, was used to estimate the parameters of model. Bias, root mean square error (RMSE), and percentage bias (PB) were used to evaluate the recovery. The results indicated that all of the estimates of parameters were closed to the true values, the absolute differences between the estimates and true values were less than .05 across all the parameters. Meanwhile, the RMSEs of these estimates were small enough, the values ranged from .04 to 0.132. Furthermore, although 7 PB values were larger than 5, most of they can be attributed to smaller denominators, which means almost of these parameters show an acceptable results. Based upon these results, it can be seem that the model can fitted the data precisely and stably, and was promising to apply the model to detect rater effect.
For the empirical study, 4 open-ended items were used to detect students’ problem solving skills of mathematic, then 20 raters were recruited to rate the responses of 80 person who answered these items. Also, the gender of student and 4 predictors of raters, responsibility and stability of emotion and confidence and rating experience, were added into the level 2 model to investigate the rater effects. Results show that among these 20 raters, almost raters show non substantial rater effects (severity / leniency), only rater 9 displayed significant severity. Furthermore, 2 predictors of raters had significant effects on rater effects, among these, responsibility had positive effect on severity, and confidence had positive effect on leniency; while the rating experience and stability of emotion of raters produce non-significant effects on rating results.