Influence Factors of Cross-Test-Cycles Linking:A Modified Single Group Design

doi:10.16719/j.cnki.1671-6981.202304025

Abstract

Abstract: The implementation of the cross-test-cycles linking (CTCL) can achieve the longitudinal comparability of the test scores of each test cycle and then characterize the development trend of the examinee's ability. The safety and efficiency of linking design, an important part of CTCL, is the premise to ensure the scientificity of CTCL scheme. International large-scale assessments (ILSAs) such as PISA, TIMSS and PIRLS all employ the non-equivalent groups anchor test (NEAT) design for CTCL. However, the NEAT design may have exposure risks. Thus it is unsuitable for LSAs in China that require a high level of test security.
To this end, this study proposed a new CTCL design (i.e., a modified single-group design) that is in line with the national conditions in China. The new design collected linking data by organizing some anchor examinees to answer some anchor items, that is, randomly selecting a linking sample from the examinees who took the new form to anwer the anchor test composed of items selected from the old form. For the new design, the equating method, the size of the linking sample, the length and item type of the anchor test, and the heterogeneity of examinee's ability distribution across test cycles all will affect the equating precision of CTCL. Before applying it to practice, thus, this paper focused on delving into the five factors' influence on the equating precision of CTCL.
To achieve this, a series of simulation studies were conducted by manipulating five factors. Specifically, four equating methods (fixed-parameter calibration [FPC], separate calibration & scale transformation [SC&ST], FPC&ST, and concurrent calibration & ST [CC&ST]), three levels of the linking sample size (1500, 8000, and 18000), two levels of anchor test length (20 and 30), two item formats of the anchor test (mixed test consisting of multiple-choice and constructed-response items and only multiple-choice items), and two levels of mean difference of examinee's ability distribution between the new and old forms (0.01and 0.25) were considered.
The results showed that: (1) FPC&ST and CC&ST outperformed the other equating methods in that they yielded smaller equating errors and were able to provide accurate and stable equating results even when the sample size of the linking sample was relatively small (i.e., 1500); (2) either the length or the item format of anchor test affected the equating precision, but the direction and magnitude of the effect varied with the equating method; (3) the difference in examinee's ability distributions was inversely proportional to the equating precision; and (4) increasing the length of the mixed-format anchor test and the sample size of the linking sample could compensate for the equating error caused by the large difference in examinee's ability distribution.
Findings suggested that FPC&ST and CC&ST methods are preferred for the modified single group-design. As to the two equating methods, the longer the anchor test we set, the smaller the RMSEs and the better the performance. However, it is recommended that the anchor test length should be at least 50 percent of the old form. Moreover, using mixed-format anchor test may improve the performance of the two equating methods. Further research may devote to implementing empirical studies, applying other IRT models, and considering multiple linking scenarios.

Key words: large-scale assessments, linkage plan, equating design, IRT equating method

摘要： 针对我国测评项目的高安全性需求,提出锚人与锚题相结合的新跨年等值设计,并采用基于实证数据的模拟研究方法探究等值方法、锚人数量、锚测验组卷方式和不同测验周期被试能力差异对等值精度的影响。结果表明：以上因素均影响等值精度且等值方法的影响突出。建议：（1）锚人较少时采用需量尺转换的等值方法;（2）锚测验组卷方式应与等值方法计算特点相匹配;（3）各周期被试能力差异较大时可酌情增加锚人或调整锚测验组卷方案。

关键词: 大规模测评项目, 跨年等值方案, 等值设计, 项目反应理论（IRT）等值方法

Chen Ping, Li Xiao, Ren He, Xin Tao. Influence Factors of Cross-Test-Cycles Linking:A Modified Single Group Design[J]. Journal of Psychological Science, 2023, 46(4): 960-970.

陈平, 李潇, 任赫, 辛涛. 改良单组设计下的跨年等值影响因素研究^*[J]. 心理科学, 2023, 46(4): 960-970.

References

[1] 蔡艳, 丁树良, 涂冬波. (2009). 铆题比例对等值精度的影响. 心理学探新, 29, 86-89.
[2] 戴海崎, 刘启辉. (2002). 锚题题型与等值估计方法对等值的影响. 心理学报, 34, 367-370.
[3] 黎光明, 梁正妍. (2019). 锚题比例与年级离散度对垂直等值的影响. 江西师范大学学报(自然科学版), 43, 52-58.
[4] 曾平飞, 李雨秦, 刘文惠, 焦丽亚, 康春花. (2017). 大规模测评中IRT等值的影响因素研究. 中国考试, 9, 22-29, 52.
[5] Angoff, W. H. (1984). Scales, norms, and equivalent scores.Educational Testing Service.
[6] Baker, F. B., & Al-Karni, A. (1991). A comparison of two procedures for computing IRT equating coefficients. Journal of Educational Measurement, 28, 147-162.
[7] Battauz, M. (2013). IRT test equating in complex linkage plans. Psychometrika, 78, 464-480.
[8] Battauz, M. (2015). Factors affecting the variability of IRT equating coefficients. Statistica Neerlandica, 69, 85-101.
[9] Budescu, D. (1985). Efficiency of linear equating as a function of the length of the anchor test. Journal of Educational Measurement, 22, 13-20.
[10] Chang H. H., Qian J. H., & Ying Z. L. (2001). A-stratified multistage computerized adaptive testing with B blocking. Applied Psychological Measurement, 25, 333-341.
[11] Cook, L. L., & Petersen, N. S. (1987). Problems related to the use of conventional and item response theory equating methods in less than optimal circumstances. Applied Psychological Measurement, 11, 225-244.
[12] Fitzpatrick, J., & Skorupski, W. P. (2016). Equating with miditests using IRT. Journal of Educational Measurement, 53, 172-189.
[13] Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22, 144-149.
[14] Hanson, B. A., & Béguin, A. A. (2002). Obtaining a common scale for item response theory item parameters using separate versus concurrent estimation in the common-item equating design. Applied Psychological Measurement, 26, 3-24.
[15] Harris, D. J., & Crouse, J. D. (1993). A study of criteria used in equating. Applied Measurement in Education, 6, 195-240.
[16] Kim, S. H., & Cohen, A. S. (1992). Effects of linking methods on detection of DIF. Journal of Educational Measurement, 29, 51-66.
[17] Kim, S. H., & Cohen, A. S. (1995). A minimum χ² method for equating tests under the graded response model. Applied Psychological Measurement, 19, 167-176.
[18] Kim, S. H., & Cohen, A. S. (1998). A comparison of linking and concurrent calibration under item response theory. Applied Psychological Measurement, 22, 131-143.
[19] Kim, S., & Kolen, M. J. (2004). STUIRT . University of Iowa.
[20] Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking: Methods and practices. Springer.
[21] Livingston, S. A. (2004). Equating test scores (without IRT). Educational Testing Service.
[22] Lord, F. M. (1980). Applications of item response theory to practical testing problems.Lawrence Erlbaum Associates.
[23] Lord, F. M., & Wingersky, M. S. (1984). Comparison of IRT true-score and equipercentile observed-score "equatings". Applied Psychological Measurement, 8, 453-461.
[24] Loyd, B. H., & Hoover, H. D. (1980). Vertical equating using the Rasch Model. Journal of Educational Measurement, 17, 179-193.
[25] Marco, G. L. (1977). Item characteristic curve solutions to three intractable testing problems. Journal of Educational Measurement, 14, 139-160.
[26] Ogasawara, H. (2000). Asymptotic standard errors of IRT equating coefficients using moments. Economic Review (Otaru University of Commerce), 51, 1-23.
[27] Ogasawara, H. (2001). Least squares estimation of item response theory linking coefficients. Applied Psychological Measurement, 25, 373-383.
[28] Petersen N. S., Cook L. L., & Stocking M. L. (1983). IRT versus conventional equating methods: A comparative study of scale stability. Journal of Educational Statistics, 8, 137-156.
[29] Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7, 201-210.
[30] Tatsuoka, K. K. (1991). Item construction and psychometric models appropriate for constructed-responses. Educational Testing Service.
[31] von Davier M., Khorramdel L., He Q. W., Shin H. J., & Chen H. W. (2019). Developments in psychometric population models for technology-based large-scale assessments: An overview of challenges and opportunities. Journal of Educational and Behavioral Statistics, 44, 671-705.
[32] Way, W. D., & Tang, K. L. (1991). A comparison of four logistic model equating methods. Paper presented at the annual meeting of the American Educational Research Association, Chicago.
[33] Wingersky, M. S., & Lord, F. M. (1984). An investigation of methods for reducing sampling error in certain IRT procedures. Applied Psychological Measurement, 8, 347-364.
[34] Wu M. L., Adams R. J., & Wilson M. R. (1997). ConQuest: Multi-Aspect Test Software. Australian Council for Educational Research, Camberwell, Victoria.