A Machine Learning Framework for Classifying Haemoglobin Levels in Sickle Cell Anaemia Patients

Authors

  • O. B. Olajide Department of Computer Science, University of Ibadan, Nigeria
  • A. B. Sakpere Department of Computer Science, University of Ibadan, Nigeria
  • A. B. Adeyemo Department of Computer Science, University of Ibadan, Nigeria
  • G. I. Ogbole Department of Radiology, University of Ibadan College of Medicine, Ibadan, Nigeria
  • S. A. Arekete Department of Computer Science, Redeemer’s University, Ede, Nigeria
  • S. B. Aribisala Department of Computer Science, Lagos State University, Nigeria

Keywords:

Haemoglobin level classification, Logistic Regression, Machine learning models, Sickle cell anaemia, Support Vector Machine

Abstract

Sickle Cell Anaemia (SCA) significantly impacts haemoglobin (HGB) levels, leading to severe health complications with high mortality rates. In Nigeria, about 2% of newborns, approximately 150,000 annually, are diagnosed with SCA. Accurate HGB monitoring is essential for effective disease management, yet traditional methods are labour-intensive and prone to errors. This necessitates automated and reliable diagnostic techniques like machine learning (ML) for improved SCA management. This study classifies HGB levels in SCA patients using clinical records and ML techniques. A dataset of 364 records (203 female population) was obtained from Kaggle; a public data repository containing eleven (11) features namely: age, sex, red blood cell (RBC) count, packed cell volume (PCV), mean corpuscular volume (MCV), mean corpuscular haemoglobin (MCH), mean corpuscular haemoglobin concentration (MCHC), red cell distribution width (RDW), total leukocyte count (TLC), platelets per cubic millimeter (PLT/mm³), and haemoglobin (HGB). Two ML models, Logistic Regression (LR) and Support Vector Machine (SVM), were used with two feature selection methods: all features and selected features. The latter identified age, RBC, PCV, MCV, and HGB as key predictors. Continuous HGB values were categorized into (1) low, (2) normal, and (3) high using standard medical metrics. SMOTE analysis was also carried out to mitigate class imbalance. SVM with a Radial Basis Function (RBF) kernel achieved 84.90% accuracy and AUC-ROC of 93.40%, while LR underperformed with 79.50% accuracy and AUC-ROC of 90.90%. Using all feature selection, SVM improved to 91.80% accuracy and AUC-ROC of 98.20%, with LR achieving accuracy of 93.20% and AUC-ROC of 98.90%. Both models demonstrated high accuracy, with LR excelling using all features, while SVM performed better with selected features. Future work will involve the use of primary datasets, additional feature selection techniques and ML algorithms, and incorporate the use of Haemoglobin variants to provide further insight into SCA progression and in turn offer personalized treatment.    

Downloads

Published

2025-12-23