A Comparative Evaluation of Embedding Techniques from Language Models for Automatic Grading of Short Answer Questions

Authors

  • V Odumuyiwa Department of Computer Sciences, University of Lagos, Nigeria
  • O Adewoyin Department of Computer Sciences, University of Lagos, Nigeria
  • A Fagoroye Department of Systems Engineering, University of Lagos, Nigeria
  • E Fasina Department of Computer Sciences, University of Lagos, Nigeria
  • B Sawyerr Department of Computer Sciences, University of Lagos, Nigeria
  • O Sennaike Department of Computer Sciences, University of Lagos, Nigeria

Keywords:

Auto-grading, transformer models, BERT, OpenAI, natural language processing (NLP)

Abstract

An automatic grading system of short answer questions on an e-learning platform can help reduce stress, save
time, increase the productivity of instructors and help provide feedback to students in record time. However, the
success of automatic grading of short answer questions (open-ended questions) depends on the ability of the
computer to adequately capture the semantic similarity between students’ answers and the reference answer
provided by the examiner. This paper presents a comparative study of some embedding techniques from
language models for automatic grading of short answer questions in order to address the longstanding challenge
of automating the assessment of students' responses to open-ended questions. It studies five embedding
techniques such as Word2vec, Bi-LSTM, BERT, SBERT, and OpenAI on four datasets (SemEval, Texas,
ASAG, and MIT) to find the best method among them for Automatic Short Answer Grading (ASAG).
Experiments include regression tasks and classification tasks using Mean Squared Error (MSE), Pearson
Correlation, and accuracy as metrics for evaluation. The results indicate that fine-tuned BERT achieved the
highest accuracy of 75% on SemEval dataset in classification tasks, while OpenAI performed better in the
regression tasks with a MSE of 0.57 on the Texas dataset. The research highlights automated grading as a means
to reduce instructors' workload while enhancing the quality of feedback provided to learners. Future studies will
focus on extending the experiments to include both domain-specific and non-domain-specific.

Downloads

Published

2025-03-07