Spam Detection In Email Communication Using Ensemble Learning

Authors

  • Ayofe Azeez Nureni Department of Computer Sciences, Faculty of Science, University of Lagos, Nigeria.
  • Dennis Tonye Alfred Department of Computer Sciences, Faculty of Science, University of Lagos, Nigeria.
  • Chinyere Chioma Isiekwene Department of Computer Sciences, Faculty of Science, University of Lagos, Nigeria

Keywords:

Spam detection models, Cybersecurity, Ensemble techniques

Abstract

Spam detection remains a critical challenge in cybersecurity due to the increasing sophistication of unsolicited and malicious communications. These messages, often containing phishing links, fraudulent offers, and malware, pose significant risks to users and information systems. This project addresses the challenge by implementing a robust spam detection system using ensemble learning techniques to enhance the security of email and SMS communications. Utilizing diverse datasets such as the UCI ML Corpus, Spam Assassin Dataset, Ling Phishing Dataset, Nigerian Fraud Dataset, and Enron Phishing Dataset, the study implemented rigorous data preprocessing and feature extraction, transforming raw text data into numerical vectors using Term Frequency Inverse Document Frequency (TFIDF) vectorization. Various Machine Learning algorithms in cluding Support Vector Machine, Logistic Regression, Naïve Bayes, Decision Trees, KNN, Extra Trees. Also, a range of ensemble learning algorithms, including Random Forest, AdaBoost, Gradient Boosting, and X GBoost, were implemented with their performance recorded. The project focuses on combining the efforts of some of these algorithms hereby comparing two primary ensemble models; the Stacking and Voting Classifiers, with the Voting Classifier emerging as the more effective. By aggregating the strengths of multiple models, the Voting Classifier demonstrated superior accuracy and reliability combining models like SVC, RF, ETC, and NB, to report accuracy and precision scores of around 98% and 99% for datasets 1 and 2, 97% and 97% for dataset 3 and 99% and 99% for dataset 5 respectively. This project underscores the potential of ensemble methods in enhancing spam detection systems and sets the stage for future research exploring the integration of deep learning models and real-time detection systems to secure digital communications further.

Downloads

Published

2025-12-20