Improved data sampling schemes for alleviating class imbalance problem
Keywords:
Imbalanced dataset, receiver operator's characteristics, ensemble learning, cost-sensitive learningAbstract
The class imbalance problem occurs when standard classifiers are biased towards the majority class while the minority
class is ignored. Existing classifiers tend to maximise overall prediction accuracy and minimise error at the expense of this
minority class. However, studies had shown that misclassification cost of the minority class is higher and should not be
ignored since it is the class of interest. This paper presents new improved data sampling schemes that can improve the
classification performance of imbalance datasets and also increase the recall of the minority class. This paper also
evaluates the performances of the improved schemes as well as the existing schemes using Receiver Operator’s
Characteristics (ROC) and recall of the minority class and Friedman Test for statistical analysis. This study was conducted
using seven different base classifiers on three datasets from different domain to compare existing sampling techniques
with the current. The improved sampling schemes often outperform the existing sampling schemes and is recommended
for pre-processing of imbalance datasets before classification so as to improve classification performance and increase
the recall of the minority class over the existing schemes.