Journal of Psychological Science ›› 2021, Vol. ›› Issue (2): 274-281.

Previous Articles     Next Articles

The Role of Machine Learning in Solving Overfitting

  

  • Received:2019-04-11 Revised:2020-03-02 Online:2021-03-20 Published:2021-03-20

机器学习在解决过拟合现象中的作用

董波1,陈艾睿1,张明2   

  1. 1. 苏州科技大学
    2. 苏州大学教育学院
  • 通讯作者: 张明

Abstract: Overfitting is the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably. There are modal overfitting and procedure overfitting in current psychology. An overfitted model is a statistical model that contains more parameters than can be justified by the data. The essence of overfitting is to have unknowingly extracted some of the residual variances (i.e. the noise) as if that variances represented underlying model structure. One of the ways to avoid overfitting is machine learning. Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to effectively perform a specific task without using explicit instructions, relying on patterns and inference instead. Machine learning algorithms are used in email filtering, detection of network intruders, and computer vision. Here, we summarized the values and ways of machine learning in prediction. First of all, we describe the current state of psychology that explanation without prediction by introducing modal overfitting and procedural overfitting. In modal overfitting, if the researcher only pursues the high degree of goodness of fit of the model, it will lead to a decrease of prediction of the model in other samples. The practice of flexibly selecting analytical procedures based in part on the quality of the results they produce has come to be known as p-hacking or data-contingent analysis. This is mainly because the bias (not variance) of sample is incorrectly fitted to it. Secondly, we show the essential reason of overfitting: the viewpoint of ‘A model could interpret well does not mean that it could reliably predict human behavior’. Although explanation and prediction are relatively close in philosophy, they differ greatly from a statistical and practical point of view. From a statistical point of view, goodness of fit is an important indicator to measure how well the model fits the data generation process. The higher the goodness of fit of the model, the more variances can be explained, the closer it is to the real process of data generation. However, the variances contain meaningful variances caused by independent variables and meaningless variances caused by sampling. Theoretically, only meaningful variance needs to fit, but the existing tools cannot distinguish between each other in practice. This limitation directly leads to overfitting. Overfitting models have a high degree of explanation for sample data, but the prediction is poor in other samples. From a practical point of view, it is difficult to construct a model with good interpretation and prediction at the same time in reality. Thirdly, the logic of modeling and key technologies of machine learning were introduced. The two principles are: 1) Decomposing errors into bias and variances, we just fit variances to avoid the phenomenon of high variance of modeling; 2) Trading off biases and variances to minimize prediction errors. The three core technologies are cross-validation, regularization and big data. All of them guarantee the prediction of the model rather than explanation. At last, we use sample data and MATLAB code to illustrate the specific application process of machine learning in model fitting. We believe that psychology researchers should try to solving practical problems, using the method(s) of machine learning and building more accurate and robust prediction models.

Key words: overfitting, machine learning, logics of modeling, key technologies, application example

摘要: 过拟合现象是心理学走向预测科学的重要阻碍。文章综述了机器学习在解决过拟合现象中的价值和实现途径:(1)介绍了过拟合的两种表现形式和现状;(2)分析过拟合的根因,即“高解释力≠高预测力”;(3)厘清机器学习的建模逻辑与核心技术在解决过拟合中的作用;(4)利用样例数据和代码说明机器学习统计思想在模型拟合中的具体应用过程。文章指出心理学应从解决实际问题的角度出发,借鉴机器学习的分析思想,避免过拟合,进而提供更准确更稳定的结论和预测模型。

关键词: 过拟合, 机器学习, 建模逻辑, 核心技术, 应用举例