A Review on Biclustering Algorithms for Data Mining Analysis of Gene Expression Data
Keywords:
data mining, data, gene expression, biclustering, gene expression analysisAbstract
Abstract
Data mining techniques have established their usefulness in extracting and bringing to light, novel and insightful discoveries from gene expression data. Over the past few decades, these approaches have been valuable for disease diagnosis, drug discovery, and understanding gene functions. Well known examples of these techniques include classification, dimension reduction analysis methods, association rules, clustering and biclustering. In recent years, as a state of the art data mining method, biclustering techniques have ascertained their indisputable efficacy for studying gene expression data. In existing literature, various studies have made attempt of classifying biclustering methods into different categories. In this paper, an extensive survey and classification of existing biclustering methods proposed in the last ten years was done. These methods were grouped into six categories namely probabilistic models, iterative greedy search, nature inspired models, linear algebra models, and hybrid approaches. It was found that hybrid, nature inspired models were particularly suited for solving complex, nonlinear, and high dimensional problems such as biclustering when compared to other methods. Nature inspired methods have the ability to solve difficult problems using seemingly simple initial rules and conditions despite having minute or essentially no knowledge of the search space. However, it is known that they might have deficiencies that prevent them from finding optimal solutions. These deficiencies can be curtailed if they are hybridized with another search method. The reviewed studies were also grouped according to the intra and inter bicluster evaluation functions that were utilized to measure the coherence within biclusters and to measure the accuracy of the algorithms to extract real implanted biclusters in a matrix. It was revealed that most of the studies that used evaluation functions utilized the MSR and Jaccard index as their intra and inters bicluster evaluation functions. It was also deciphered from the review that most of the studies were focused on yeast expression data and a few other gene expression data sets. This study therefore proposes that more attention should be given to the study of other expression data set in order to enhance improved disease diagnosis, prognosis and disease prevention.