Abstract Electronic commerce (EC) is an Internet-based technology that has gained wide acceptance by business operators and its usage has drastically increased over the years due to the transformation enabled by information technology. Despite the progres
Keywords:
Self-Organizing map,, Protein sequences, Alignment-free algorithm, Amino acid content ratio, Amino acid position ratioAbstract
Abstract
Advancement in biotechnology has resulted in an increase in the rate at which biological data such as RNA, DNA and proteins are being sequenced. Inherent in the primary structures of proteins are features capable of providing information that can be used for classification using machine learning tools. In this study, a clustering model is designed for protein sequences using an alignment-free encoding technique and the Self Organizing Map (SOM). The model is an integration of an alignment-free encoding technique (Amino-acid Content Ratio (ACR) + Aminoacid Position Ratio (APR)) with the SOM algorithm. The encoding technique generates a 40 dimensional feature vector for each protein sequence which the SOM algorithm used to perform a clustering task. The SOM nodes are initialized randomly from the sample space which makes the ordering of the nodes faster. The model was implemented using the Java programming language and was evaluated using a data set of 500 sequences made up of five classes of Proteins (100 sequences each) which were collected from the UniProt Knowledgebase. Clustering of the data set was performed using learning rates of 0.1-0.9. A comparative analysis of the model against the use of only ACR encoding technique was also performed. The results showed that the model is valid and consistent in discovering quality protein clusters with a low standard error value of 0.2percent for Sensitivity test and a low standard error value of between 0.05-0.1percent with respect to specificity test. It also showed that the (ACR+APR) encoding technique is more sensitive and specific when compared to the ACR technique.