心理科学 ›› 2021, Vol. ›› Issue (2): 330-339.

• 发展与教育 • 上一篇    下一篇

心理因素与学业表现:机器学习分类预测模型

丁欣放1,聂晶2,张斌1   

  1. 1. 首都医科大学
    2. 北京大学
  • 收稿日期:2020-05-16 修回日期:2020-12-17 出版日期:2021-03-20 发布日期:2021-03-20
  • 通讯作者: 丁欣放

Using Demographic Information, Psychological Assessment Data and Machine Learning to Predict Students’ Academic Performance

  • Received:2020-05-16 Revised:2020-12-17 Online:2021-03-20 Published:2021-03-20
  • Contact: Xin-Fang DING

摘要: 随着高等教育规模的扩大,学业表现不良逐渐成为一个不容忽视的现象,对学业表现不良的学生进行预测并提早给予干预可降低退学率并减少教育资源的损失。由于导致学业表现不良的因素众多且关系复杂,传统的基于相关分析的研究方法很难建立早期预测模型并进行应用。本研究旨在利用机器学习算法,对数据进行挖掘,并建立学业表现预测模型。研究对653名大一新生的心理健康状况、应对方式、人格、内外控倾向和相关人口统计学信息进行了收集,并在一年后采集了其学业成绩,利用随机森林(RF)、K邻近(KNN)、支持向量机(SVM)、决策树(DT)、朴素贝叶斯(NB)等机器学习算法建立了分类模型。结果显示,随机森林算法在识别学业表现不良学生时有最好的表现,其中准确率95.86%, 召回率91.83%,f1分数为93.80%。特征重要性分析显示,前10个对模型有最高贡献度的特征包括:年龄、性别、是否为独生子、内外控倾向、神经质倾向、积极应对倾向、宜人性倾向、一般症状指数、开放性倾向和焦虑水平。为避免过度拟合问题,本研究在一年后收集的166名新生样本中进行了模型验证,结果显示模型在新数据集上有较好的泛化表现,其中f1分数90.90%,准确率92.60%,召回率89.26%。研究提示基于人口统计学和心理测评信息,机器学习算法有助于及早识别学业表现不良学生并为开展早期干预提供启示。

关键词: 学业表现, 机器学习, 预测, 心理因素, 分类预测模型

Abstract: Tracking college students’ academic performance and predicting students who will be likely to fail courses are important to providing early intervention and increasing retention rates. Previous studies have found that many psychological factors are correlated with academic marks, including personality, coping styles, mental health and academic and social motivational constructs. However, the traditional way of studying correlational factors often fails in providing an early prediction model since the mechanism underlying poor academic performance is generally complicated and sometimes the patterns are even implicit. Machine learning is an approach that detects implicit patterns via algorithms and statistical models in the big data, which can optimize exploratory analysis by providing internal cross-validation and is more robust to outliers. The present study aimed at utilizing a machine learning approach involving demographic information and the results of psychological assessments as input to classify students who have failed courses from those who have not failed courses in their first year at college. Six hundred and fifty-three participants from five universities in northern China were recruited. They were required to complete demographic information survey, Symptom Checklist 90, Rotter Internal-External Locus of Control Scale, Trait Coping Style Questionnaire and The Big-Five Personality Inventory-10. Those questionnaires measured mental health, coping styles, personality and generalized control expectations on internal-external locus respectively. Academic performance information was collected one year later. The low performing students were defined as having at least one course failed in their first year at college. Five machine learning algorithms including Random Forests (RF), K-Nearest Neighbors (KNN), Support Vector Machine (SVM), Na?ve Bayes (NB) and Decision Tree (DT) were trained to build dichotomous classification model to detect low-performing students. The results showed that the highest classification f1 score was obtained by RF algorithms, with accuracy = 99.00%, precision = 95.86%, recall = 91.83% and f1 score = 93.80%. The feature importance analysis revealed that the features extracted from demographic information and psychological assessment questionnaires were both important in predicting a college student’s academic. The top 10 most important features in RF algorithm included age, gender, whether the student is the only child or not, internal-external locus control, neuroticism, positive coping, agreeable, general symptomatic index, openness and anxiety level. To avoid overfitting, which occurs when the model fits the peculiarities of the training dataset too much and does not find a general predictive rule, a new dataset (n=166) was collected and used to test the generalization performance of the predicting model in the present study. According to the results, the model showed a good generalization performance on the new dataset that was collected one year later with f1 score = 90.90%, accuracy = 97.84%, precision = 92.60% and recall = 89.26%. The study shows the potential of machine learning approaches in predicting students who will be likely to fail courses by using demographic and psychological assessment information. The results demonstrated that the RF algorithm could be used effectively to build a classification model that identifies low-performing students, indicating the applications in the future where early intervention for low-performing students is possible.

Key words: Academic performance, Machine learning, Prediction, Psychological factors, Classification prediction model