Journal of Psychological Science ›› 2022, Vol. 45 ›› Issue (4): 988-997.

Previous Articles     Next Articles

Effects of Several Factors on IRT Observed Score Kernel Equating

  

  • Received:2020-04-10 Revised:2020-11-17 Online:2022-07-20 Published:2022-07-20

项目反应理论观察分数核等值的影响因素

王少杰1,2,张敏强2,黄菲菲2,黄丽芳2,袁琪婷2   

  1. 1. 广东第二师范学院
    2. 华南师范大学
  • 通讯作者: 张敏强

Abstract: Attributing to its advantages of pre-smoothing and continuization of score distributions, kernel equating has been testified and shown equivalent to or better than other equating methods, especially traditional ones, in the aspect of equating accuracy and stability. IRT observed score kernel equating is formed by integrating kernel equating and IRT observed score equating. Few researches have focused on evaluating its performance systematically. Therefore, bandwidth selection method, sample size, test length, equating design, and data simulation methods were investigated about their influence on it. To ensure ecological validity, data from a large-scale assessment were used as the sampling pool. IRT data simulation method and pseudo tests and pseudo groups simulation method were used to avoid the simulation preference in random Equivalent Groups design (EG) and Non-Equivalent groups with Anchor Test design (NEAT). In detail, bandwidth selection methods included Penalty method, Silverman’s rule of thumb method, and Double smoothing method. Levels of sample size were 1000, 2000, and 5000. Meanwhile, test containing 30 items and 45 items were considered. Finally, local criteria and universal criteria were computed, the former of which were Percent Relative Error (PRE) and Standard Error of Equating (SEE), and the latter of which were Averaged Percent Relative Error (APRE) and Averaged Standard Error of Equating (ASEE). It was found out that in EG, regarding local criteria, PRE increased as central moment became higher, which also meant that the distribution difference before and after equating was enlarged. Nonetheless, considering that PRE was formed by multiplying initial difference with 100, bandwidth selection methods performed alike. On the other hand, PRE was significantly reduced by increasing sample size and lengthening tests, especially by the latter one. Similar to PRE, when it came to SEE, there was no difference between effect of bandwidth selection methods. Larger sample size rendered less random error, which was contrary to test length. Furthermore, curves of SEE were “high at left but low at right” for pseudo tests and pseudo groups method, and “low at left but high at right” for IRT simulation method. As for universal criteria, APRE among bandwidth selection methods were alike, which were all small. Effects of sample size and test length were same as observed in local criteria. There was no significant difference between ASEE for two data simulation methods. In NEAT, regarding local criteria, PRE increased as central moment became higher. The results of Penalty method and Silverman’s rule of thumb method coincided, which were superior to others. And this trend was more evident when test is shorter. PRE was significantly reduced by lengthening tests as in EG, but not by increasing sample size. To be mentioned was the results that PRE for Double smoothing method was most influenced by sample size when test included 30 items and IRT simulation method was used, which indicated some interactions among them. When it came to SEE, bandwidth selection methods performed alike, only showing discrepancies at extreme scores. Increasing sample size and lengthening test could reduce random error. Meanwhile, distribution of SEE for pseudo tests and pseudo groups method was more stable than that for IRT method. As for universal criteria, the trends for APRE and ASEE were same as those in local criteria. To summarize, performances of bandwidth selection methods were similar in EG, but Penalty method and Silverman’s rule of thumb method prevailed in NEAT. Bandwidth selection, sample size, and test length affected IRT observed score equating together. Preference of data simulation methods was spotted, which suggested researchers that multiple simulation methods and designs should be conducted before final conclusions are drawn in the field of comparison of equating method. Further study should focus more on the systematic evaluation of equating.

Key words: IRT observed score kernel equating, bandwidth selection methods, equating design, data simulation methods

摘要: 探究带宽选择方法、样本量、题目数量、等值设计、数据模拟方式对项目反应理论观察分数核等值的影响。通过两种数据模拟方式,获得研究数据,并计算局部与全域评价指标。研究发现,在随机组设计中,带宽选择方法表现相似;考生样本量和题目数量影响甚微。在非等组设计中,惩罚法与Silverman经验准则表现优异;增加题目量可降低百分相对误差和随机误差;增加样本量导致百分相对误差变大,随机误差减小。数据模拟方式可影响等值评价。未来应重点关注等值系统评估。

关键词: IRT观察分数核等值, 带宽选择方法, 等值设计, 数据模拟方式