Dr. Sumeet Dua

Max P. & Robbie L. Watson Eminent Scholar Chair

  • Full Screen
  • Wide Screen
  • Narrow Screen
  • Increase font size
  • Default font size
  • Decrease font size

Ravi Kanth Meka (2005)

E-mail Print PDF

Fast Protein Structure Classification Using Spatial Aggregation of Orthonormal Coefficients; MS-CS Thesis; Student: Ravi Kanth Meka (2005)

The drastic increase in the size of protein sequence and structure databases has necessitated the design and development of novel computational frameworks for the rapid and accurate evaluation of the similarity of protein structures. This necessity has led to the development of protein structural comparison and classification methods, which use inherent spatial properties of a polypeptide chain as the criteria for similarity calculation of 3D protein structures. Most of the previous methods in this area have demonstrated performance, but at the cost of high computational time and sometimes with lack of coherence between computationally calibrated results and the known biological correlations, hence leaving a need for further innovation.
In this thesis, we propose a unique computational paradigm for the feature extraction of protein structure sequences, and then employ fast orthogonal transformation for the comparison analysis of pairs of large protein structural sequences. We then employ SR-trees, a spatial data structure, to index the protein database to enable a similitude search without an increase in the degree of false dismissals. By employing a pairwise feature representation, we also address the ‘curse of dimensionality’ in these databases, without significantly compromising accuracy. 
Our computational framework is applied on three different datasets, for performance evaluation. The first two datasets comprise of five randomly chosen families, with 16 proteins in each family, and then with proteins from five selected families that are frequently employed in the literature for similarity calibration in structural datasets. The confusion matrix is obtained to identify the families that are normally confused with each other, using their degrees of misclassification. Finally, to show that our method can work on large databases with much reduced computational time and without compromising accuracy, we employed a database consisting of 183 classes with 10 proteins each. Feature extraction is performed on this database and the feature-space is then indexed using an SR-tree. Nearest neighbor classification is performed to classify proteins in their respective families. Various validation procedures, including ROC plots, confusion matrix, kappa coefficients, degree of false dismissals, and degree of false alarms, are employed to demonstrate the accuracy of the proposed data mining framework.

You are here: Research Student Thesis