Dr. Sumeet Dua

Max P. and Robbie L. Watson Eminent Scholar Chair

  • Full Screen
  • Wide Screen
  • Narrow Screen
  • Increase font size
  • Default font size
  • Decrease font size

Pradeep Chowriappa (2008)

E-mail Print PDF

Integrated Mining of Feature Spaces for Bioinformatics Domain Discovery

One of the grand challenges of Bioinformatics is the elucidation of protein folding and functional annotation of proteins. The factors that govern protein folding include chemical, physical, and environmental conditions of the protein’s surroundings, which can be measured and exploited for computational discovery purposes. These conditions enable the protein to transform from a sequence of amino acids to globular 3-dimensional structures. Information concerning the folded state of a protein has great potential in the explanation of biochemical pathways and their involvement in disorders and diseases. This impacts the characterization and curing of genetic diseases and the creation of designer drugs.

With the exponential growth of protein databases and the limitations of experimental protein structure determination, sophisticated computational methods have been developed and applied to detect, search for, and compare protein homology. Most computational tools developed for protein structure prediction are primarily based on sequence similarity searches. These approaches have improved the prediction accuracy of high sequence similarity proteins, but have failed to perform well with proteins of low sequences similarity. Data mining offers unique algorithmic computational approaches that have been widely used in the development of automatic protein structure classification and prediction.

In this dissertation, we present a novel approach for the integration of physico-chemical properties and effective feature extraction techniques for the classification of proteins. Our approaches overcome one of the major obstacles of data mining in protein databases, that is, the encapsulation of different hydrophobicity residue properties into a much reduced feature space that possess high degrees of specificity and sensitivity in protein structure classification. To this end, we propose a computational framework for coherent feature extraction on selected scales of hydrophobicity for a protein sequence. Plagued by the problem of the unequal cardinality of proteins, our proposed integration scheme effectively handles the varied sizes of proteins.

We also detail a two-fold contribution to protein annotation. First, we exhibit our success at creating an algorithm that provide means to integrate multiple physico-chemical properties in the form of a multi-layered abstract feature space, with each layer corresponding to a physico-chemical property. Second, we discuss a wavelet based segmentation approach that efficiently detects regions of property conservation across all layers of the created feature space

Finally, we present a unique graph-theory based algorithmic framework for the identification of conserved hydrophobic residue interaction patterns using identified scales of hydrophobicity. We report that these discriminatory features are specific to a family of proteins, which consist of conserved hydrophobic residues that can be used for structural classification. We also present our stringently tested validation schemes, which report significant degrees of accuracy to show that homologous proteins exhibit conservation of physico-chemical properties along the protein backbone. We conclude our discussion by summarizing results, contributions and laying the directions for future research.

You are here: Research Student Thesis