Assessment of Selected Data Mining Classification Algorithms for Analysis and Prediction of Certain Diseases
Keywords:Classification algorithm,, Data mining, Decision tree, Naïve Bayes, k-nearest neighbour
Medical science generates large volumes of data stored in medical repositories that could be useful for extraction of vital hidden information essential for diseases diagnosis and prognosis. In recent times, the application of data mining to knowledge discovery has shown impressive results in disease analysis and prediction. This study investigates the performance of three data mining classification algorithms, namely decision tree, Naïve Bayes, and k-nearest neighbour in predicting the likelihood of the occurrence of chronic kidney disease, breast cancer, diabetes, and hypothyroid. The datasets which were obtained from the UCI Machine were split into 60% for training and 40% for testing on the one hand and 70% for training and 30% for testing on the other hand. The performance parameters considered include classification accuracy, error rate, execution time, confusion matrix, and area under the curve. Waikato Environment for Knowledge Analysis (WEKA) was used to implement the algorithms. The findings from the analysis showed that decision tree recorded the highest prediction accuracy followed by the Naïve Bayes and k-NN algorithm while k-NN recorded the minimum execution time on the four datasets. However, k-NN also has the largest average percentage error recorded on the datasets. The findings, therefore, suggest that the performance of these classification algorithms could be influenced by the type and size of datasets.