spacer
Louisiana Tech University's Home Page CEnIT Home Page DMRL Home Page  
spacer
 
header

Tools Description
People
Contact


 

Physico-chemical Property Analysis Tool for Assessment of Protein Domain Conservation

Pradeep Chowriappa1 and Sumeet Dua1,2

1Data Mining Research Laboratory, Louisiana Tech University, Ruston, LA 71270, USA. 2LSU Eye Center, LSU Health Sciences Center, New Orleans, LA 70112-2234.
{pradeep, sdua}@latech.edu

Abstract. Proteins are not rigid bodies; they are flexible and constantly change shape and form to perform their biological roles. While we are intuitively aware of their constantly changing nature, we have little understanding of how the inter-residue interactions are encoded in the protein sequence and, therefore, have little understanding as to how they drive structural flexibility. To address this knowledge gap, we propose a tool to predict and analyze those regions over the primary sequence of the protein that are of functional importance using a myriad of physico-chemical properties. We hypothesize that there is a correlated characteristic among the physico-chemical properties of a protein that can be used to predict functionally significant regions using machine-learning approaches. The proposed tool contributes to our understanding of the sequence and physico-chemical relationship and paves the way for us to identify local sequence property modulations that impact protein function without changing the protein structure.

Keywords: Protein domain, physico-chemical properties, structure prediction.

1 Introduction

Many protein regions (subsequences), over the body of a protein, are intrinsically conserved across related proteins. These intrinsically conserved regions are crucial to the function of many proteins, especially those involved in signaling, recognition, and regulation [1]. This study was motivated by several empirical observations relevant to protein structure, function, and evolution and by previous studies that addressed the relationship between the impairment of protein function and the resulting disease [2], [3]. There is also weak evidence that protein impairment and disease sensitivity are correlated with the physico-chemical difference between evolutionary constraint, functional impairment, and disease severity [2]. The goal of this work is to develop a (cyber)tool that overcomes the challenges inherent to the identification of those protein impairment and disease insensitivity characteristics, and functions of conserved regions within a protein by analyzing its effect under various physio-chemical conditions.
Our tool-based analysis assumes (a) that evolutionary variation among orthologs in the affected position is a sample of the physico-chemical properties that are tolerated at that position and (b) that correlated mutations of physico-chemical interactions between residues reveal evolutionary residue conservation patterns that reflect conserved structural domains. By using these two ideas as a premise, we develop the following tool.

2 Proposed Methodology


Fig 1. The proposed model [1].

2.1 Alignment of Sequences

We first build a multiple alignment of orthologs or closely related paralogs; distant paralogs are excluded to avoid including evolutionary variations that specify functional differences. The sequences’ evolutionary relationships are inferred by standard likelihood analysis [4], which also yields the branch lengths in substitutions per site of the tree. Based on the topology and branch lengths of the tree, weights are calculated for each sequence. These weights are a control for phylogenetic correlation among the sequences.
We then multiply the weights with the fraction of sequences carrying a particular amino acid to get the alignment summary (matrix summary) which we interpret by using a matrix of physico-chemical property scales. The result is an estimate of the physico-chemical constraints on each position based on the mean and variance of the property distributions observed in its alignment. These statistics are biologically significant where the mean measures hydrophobic character, and the variance measures the strength of the constraint. Deviations from the alignment column are obtained for each variant by calculating its property difference from the mean and dividing it by the square root of the variance. We can interpret this statistic as a signed measure of “constraint deviation”.
To compute a single score measuring the deviation of constraint across all properties, we first de-correlate the properties using Principal Component Analysis (PCA). This application gives rise to a new feature space in which each axis is a principle component, and the distance from the origin to any amino acid is the amino acid impact score.
Based on the principles of sequence alignment, the most common starting point for generating a model, we aim to capture those domains that reflect structural conservation between homologous proteins. We hypothesize that physico-chemical information along with sequence information can direct the construction of acceptable models. We also show that regions surrounding insertions and deletions are much less conserved than the core, and we discuss the implications of this observation for modeling APoE proteins.

2.2 Amino Acid Descriptors

We use the quantitative descriptors for the 20 amino acids as proposed by Venkatarajan and Braun [5]. Using multi-dimensional scaling, Venkatarajan and Braun summarized information from 237 known physico-chemical properties in the hopes of providing useful information for the identification of protein homologues on the basis of property-based motifs. As per [5], the components (scales) E1 to E3 describe the hydrophobicity, size, and helical propensity of a protein sequence, while E4 describes partial specific volumes, the relative abundance of amino acids, and the number of codons. The β strand forming propensity is the dominant factor for E5. We propose to use these five components to create a protein map for a given protein.

2.3 Significance of Conserved Regions

We analyze all the conserved residues and compare the structural environment to amino acids in the naturally occurring proteins in the dataset, using packing density, hydrogen bonding, and solvent accessibility. The following are the methods that will be used to determine the parameters [7]: a) Packing Density (Ooi Number), b) Hydrogen Bond Information, and c) Solvent Accessibility.

2.4 Creation of Protein Maps

Fig. 2 shows the creation of a protein map using physico-chemical properties as descriptors. For more information about the methodology, refer to [7].


Fig 2. Creation of protein maps.

3 Results

We adopt a hierarchical clustering-based approach to identify clusters of protein map segments that exhibit similar characteristics. As mentioned, the approximate coefficients of each segment are applied as time-frequency descriptors to group the segments of a protein map layer. We adopt the ‘Euclidean distance’ approach to measure the similarity between the approximate coefficients of segments. We rank the silhouette scores of each cluster in the hierarchy and choose those segments that constitute the highest ranking cluster. Each segment of the fA corresponds to the correlated mutation scores of the sequence windows. It is thus simple to back track to those regions for the given protein. Fig. 5 provides an overview of the resultant hierarchical clustering of segments and the resulting frequency aggregates of a single layer of the protein map of protein 1AAQ.


Fig 3. Layer of protein map for protein 1AAQ.


Fig 4. Structure of protein 1AAQ.


Fig 5. Results of analysis of the structural environment of conserved residues.

Download

The P3Maps tutorial and executable file are available for download. Download Tutorial EXE.

References

  1. Smith, T.F., Waterman, M.S.: Identification of Common Molecular Subsequences. J. Mol. Biol. 147, 195-197 (1981).
  2. Stone, E.A., Sidow, A.: Physicochemical constraint violation by missense substitutions mediates impairment of protein function and disease severity. Genome Res. 15, 978-986 (2005).
  3. Lau, A.Y., Chasman, D.I.: Functional classification of proteins and protein variants. PNAS 101, 6576—6581 (2004).
  4. Friedman, N., Ninio, M., Pe'er, I., and Pupko, T.: A structural EM algorithm for phylogenetic inference. J. Computat. Biol. 9, 331-353 (2002).
  5. Venkatarajan, M.S., Braun, W.: New Quantitative Descriptors of Amino Acids Based on Multi Dimensional Scaling of a Large Number of Physical-chemical Properties, Journal of Molecular Modeling 7, 445-453 (2001).
  6. Chowriappa, P., Dua, S., Kanno, J., Thompson H.W.: Protein Structure Classification Based on Conserved Hydrophobic Residues, IEEE/ACM TCBB 99, 5555 (2008).
  7. Dua, S., Chowrippa, P.: Protein Maps: Physico-chemical Properties Integration for Functional Annotation of Proteins, In: 7th Asia-Pacific Bioinformatics Conference, The Asia Pacific Bioinformatics Conference (APBC) (2009).
spacer
This site is maintained by the Data Mining Research Laboratory. Webmaster: Alan E. Alex & Image Master: Pradeep Chowriappa
spacer