|
Members Only Area:(Login required) |
|
|
|
Bytes of Scientific Nuggets |
|
|
|
|
Research Thesis & Practicum
Biomedical Informatics is the science of managing, mining, and interpreting information from data originating in biological and biomedical domains. The problem of heterogeneous data mining deals with the computational challenges of searching multimedia data in a unified computational framework that can answer the similarity queries of data mining by accurate and efficient means. The advances in data collection methodologies have generated large data-warehouses, in an assortment of application domains, including, but not limited to biomedical and multimedia databases. Heterogeneous data indexing has proven to be a valuable tool for complex data mining in large data domains which are inherently semi-structured in nature. We propose a solution to integrate the feature vectors of image and text by cooperatively representing them in a multidimensional spatial data structure, which has previously exhibited superior search performance in image database domains. We have evaluated the results of content-based similarity queries on the indexing schema independently in images and textual domains. We have then studied and represented the effect of the choice of similarity metric on the similarity queries. We then propose an indexing schema that integrates the feature vectors of text and images to answer integrated queries on the unified heterogeneous data space. An added advantage of the proposed methodology is embodied by the fact that a textual feature vector can query a heterogeneous database to retrieve text and images as query results. This feature vector solves the problem of wasted time individually querying each data-domain separately and sequentially scanning the integrated database for similarity results. The proposed methodology is time and space efficient, and is capable of answering complex heterogeneous data mining queries to find sound applications in biomedical and clinical domains.
2. Fast Web Usage Mining for Automatic Web Personalization; MS-CS Thesis; Student: Suyang Zhang (2004)
The wealth of information on the World Wide Web (WWW) has spurred tremendous interest in the areas of knowledge discovery and data mining. Most web structures are large and complicated, so web users often miss the goal of their inquiry and suffer from information overload due to constantly updating and rapidly evolving data-spaces. On the other hand, the capacity of an individual to access knowledge and digest information is mostly fixed. To help web users access information efficiently and accurately, it is now necessary to anticipate the needs of those users. Currently, it is a trend for large websites to recommend personalized information to particular web users by extracting models of navigational behaviors, by means of web personalization. Aiming at offering a personalized view of the web services to web users, web mining has gained great momentum in both research and commercial areas. Web mining, particularly web usage mining, is considered a main component of efficacious web personalization system.
In this thesis, we describe an automatic and effective personalization system, which uses web log files for data preparation and clustering, to mine the data using an offline computational methodology, followed by an extraction of a model to provide dynamic and real time online recommendations. The data-preparation task processes web access log files using heuristics methods. We describe effective data mining techniques based on a distance-based similarity measure and a hyper graph-based clustering to obtain a uniform representation for transaction results. The developed recommendation engine computes a recommendation set for the current user session. It then returns the personalized pages, which is embed Microsoft’s Component Object Model (COM) to dynamically calculate the matching score.
3. Efficient and Flexible Update of Association Rules in Growing Databases; MS-CS Thesis; Student: Yifei Long (2004)
Efficient association rule discovery in large databases has attracted a great deal of interest, especially in recent years. Although a significant amount of research effort has been evidenced in the development of novel methodologies for the discovery of association rules from large databases, the importance of updating these rules in growing databases has not been sufficiently discussed in the literature. The problem is complex because database growth can introduce new rules and invalidate some that have been previously discovered. It is not computationally efficient to run the rule discovery algorithm from scratch on the new database for even some minuscule changes in the old database, but a strategy to dynamically update the rule set with database growth is useful in real-time and high-dimensional data mining applications.
In this thesis, we have proposed a novel and efficient algorithm called Dynamic-Update for the incremental update of association rules when new transactions are added to a large database. The proposed algorithm scans the old database exactly once in contrast to some of the previous results in this area. Additionally, the minimum support required for association rule discovery can be flexible. We also propose a new methodology called ExtremePrune to prune the candidate itemsets as early as possible during the course of the incremental rule discovery process. The algorithm is also demonstrated to have significantly improved time performance relative to the previously reported results in this area.
4. A Visual Data Mining Framework for Similarity Search in Large Sequential Databases; MS-CS Practicum; Student: Sunil Gokak (2004)
Signals that are dependent on time occur very commonly in our day to day lives and surroundings. Signals may represent acoustic information, stock market data, and biological and clinical data sets, which are dependent upon time. Although sampling and harmonic analysis of signals can enable efficient signal analysis in the frequency domain, similarity search for data mining in signal analysis can address issues like the prediction of values, the classification of items, piece-wise correlation estimation, and the unsupervised clustering of time dependent data sets. Efficient indexing techniques and algorithms are developed in this domain to address the curse of dimensionality evident in this data. This work has built a value-added Visual Data Mining Framework that would enable users, through a web-based interface, to analyze time series data by conveniently interacting with efficient data mining algorithms. The application primarily aims to address descriptive data mining, which would be useful from the point of view of understanding the mechanics governing long term and short term fluctuations in large time-series data. An efficient webcrawler is also designed to access online time-series data through an interactive and user-controlled graphical interface. Additionally, the application demonstrates the fusion of Java technology with Matlab for real-time data interoperability between the two programming tools. The application can assist a non-data mining expert to employ efficient similarity search algorithms for applications in areas including inventory planning and material management, sales forecasting, demand forecasting, market research / business conditions, biomedical signal analysis and classification, protein data mining, and functional classification of genes.
5. Computational Identification of Tumor Gene Markers Using Novel Dimensionality Reduction and Unsupervised Classification Techniques; MS-CS Thesis; Student: Kaustubh Sabnis (2004)
The successful treatment of cancer depends on how early and how accurately it is detected. Molecular diagnosis has great potential to predict the diagnosis precisely as compared to clinical diagnosis. Biologists do not have complete information about the molecular markers that are responsible for causing most solid tumors. Because of the massive amount of genes present in homo sapiens, it is practically impossible to find the genes responsible for each type of cancer class by carrying out experiments in a wet laboratory. We propose to find highly informative genes, responsible for multiple cancer types using discrete wavelet transformations as a dimensionality reduction technique.
Discrete wavelet transformations are applied to a preprocessed GCM cancer gene expression data giving orthonormal wavelet coefficients for each sample. These coefficients are passed through two filters: one filter selects coefficients with top energy levels, and the second filter chooses a common set of coefficients among samples belonging to the same cancer class. Inverse DWT is applied to these filtered coefficients yielding marker genes per cancer class. We cross-validate these results against a biologically significant database of cancer genes. A total of 21 genes spanning 7 cancer classes are found in common with the cancer gene database. With the exception of two cancer classes (breast and bladder), we identify 41.67 % more cancer causing genes than the method used by Ramaswamy et al.
6. Relevant Feature Extraction Using Gene Ontology for Cancer Classification; MS-CS Thesis; Student: Vijay Raj Kukkala (2005)
Microarray gene expression data analysis has been one of the major research areas in the field of Bioinformatics. Among the several uses of microarray data, the functional categorization of genes and the classification of cancer samples into classes for early diagnosis have been of much interest. In our thesis, we have addressed the latter issue. Many methods like hierarchical clustering, self organizing maps, and support vector machines have been published in literature, which try to identify genes useful for classification using mathematical models. However, not many methods delve into categorization through functional aspects. The motivation behind our work is to explore and use the functional relationship in addition to the mathematical relationship of the genes, in classifying disease specific samples.
We propose to find functional relationships between genes using the Gene Ontology hierarchy. After preprocessing the expression data, we use the apriori algorithm to find association rules, which reflect all possible gene-pair relationships using some criterion. This obtained set of gene pairs is examined for similarity using the Gene Ontology hierarchy for functional relationship. Obtained functionally, similar genes are used in the unsupervised classification of the samples. Using our relevant feature extraction method, we have been able to classify more than 82% of the samples accurately.
7. Fast Protein Structure Classification Using Spatial Aggregation of Orthonormal Coefficients; MS-CS Thesis; Student: Ravi Kanth Meka (2005)
The drastic increase in the size of protein sequence and structure databases has necessitated the design and development of novel computational frameworks for the rapid and accurate evaluation of the similarity of protein structures. This necessity has led to the development of protein structural comparison and classification methods, which use inherent spatial properties of a polypeptide chain as the criteria for similarity calculation of 3D protein structures. Most of the previous methods in this area have demonstrated performance, but at the cost of high computational time and sometimes with lack of coherence between computationally calibrated results and the known biological correlations, hence leaving a need for further innovation.
In this thesis, we propose a unique computational paradigm for the feature extraction of protein structure sequences, and then employ fast orthogonal transformation for the comparison analysis of pairs of large protein structural sequences. We then employ SR-trees, a spatial data structure, to index the protein database to enable a similitude search without an increase in the degree of false dismissals. By employing a pairwise feature representation, we also address the ‘curse of dimensionality’ in these databases, without significantly compromising accuracy.
Our computational framework is applied on three different datasets, for performance evaluation. The first two datasets comprise of five randomly chosen families, with 16 proteins in each family, and then with proteins from five selected families that are frequently employed in the literature for similarity calibration in structural datasets. The confusion matrix is obtained to identify the families that are normally confused with each other, using their degrees of misclassification. Finally, to show that our method can work on large databases with much reduced computational time and without compromising accuracy, we employed a database consisting of 183 classes with 10 proteins each. Feature extraction is performed on this database and the feature-space is then indexed using an SR-tree. Nearest neighbor classification is performed to classify proteins in their respective families. Various validation procedures, including ROC plots, confusion matrix, kappa coefficients, degree of false dismissals, and degree of false alarms, are employed to demonstrate the accuracy of the proposed data mining framework.
8. Dihedral Angle Based Dimensionality Reduction and Accurate Classification of Protein Structures; MS-CS Thesis; Student: Naveen Kandiraju (2005)
The objective of this study is to develop a unique data mining framework for the structural comparison and classification of proteins, by representing each of them in terms of a pair of secondary structure geometric descriptor distributions. In this thesis, we propose novel geometric parameters based on a comparison protocol that uses a previously unexplored pair of dihedral angles for the similarity search. The similarity analysis is performed based on the pair wise dihedral distribution representation of the protein structure. As a part of the similarity calibration procedure, a frequency transformation is employed on the two-dimensional distribution for feature extraction and selective feature filtering, which is then represented in an indexing schema later used for similarity calibration. The proposed similarity measure captures the structural similarity among proteins with less sequence identity, and its ability to classify the proteins is evaluated by conducting experiments across four datasets, of varied sizes, of protein structures belonging to different families randomly selected from Alpha, Beta, Alpha and Beta (alpha/beta), and Multi-domain proteins (alpha and beta) classes. The results demonstrate the success of this dimensionality reduction based similarity measure in performing a rapid and length-independent similarity analysis of the protein structures.
9. Web-Based Online Appointment Manager with Data Mining Capabilities; MS-CS Practicum; Student: Venkat Praveen Medikonda (2005)
Precise time management and scheduling of appointments are two major challenges faced today. This practicum involves in development of an intelligent appointment manager tool which allows the creation of multi-user appointments, calendar events and provides improved scheduling by suggesting user availability patterns. This scheduling tool sends instant notifications on all new appointments by email. Web-based online appointment manager empowers professors and students to use time efficiently by allowing, managing and scheduling appointments with great ease. Users can access and self-schedule appointments anytime from any location. Appointment manager uses data mining techniques to personalize appointments by finding frequent patterns. These help the user to make precise appointments.
10. Enhancement of Instructional Technology by Using Feedback Support System for Access Grid Framework; MS-CS Thesis; Student: Shraddha Pathak (2005)
The objective of this study is to develop a unique feedback support system for enhancement of instructional technology over access grid computational framework. The framework of feedback system can add functionality to the existing collaborative research and conferencing environment using an access grid network. In this thesis, we propose a novel image feature based data mining approach for autonomously identifying a student, determining the attentive level of a student and discovering association relationships between the student’s attention levels and instructor’s positional behaviors. The framework is composed of mainly training and testing phases for finally attaining the attentive level of the student and the relation between the student and the instructor’s behavior over a period of time. Students are registered in the training phase, and their feature vectors are discovered and stored. The feature vectors are employed for student identification in the testing phase with a very low degree of false alarms and false dismissals. To determine the attention level of a student, a video of the student’s behavior (taken at 30 frames/sec) is used, and a classifier is built using the time-changed feature information. Based on the attention levels, each student is reported as attentive or non-attentive. The training phase for behavior description between student and instructor is then performed which results in the classifier definition for the instructor’s positional behavior. Finally, association rule discovery is performed between the student’s attention class definition and positional behavioral class. This framework solves the difficulty of attaining online feedback from the audience in a collaborative research and conferencing environment, adding value to the access grid research project and its deployment in the collaborative education and research community.
11. Discovery of Active Metabolic Paths Using Association Rules; MS-CS Practicum; Student: Sree Harsha Pothireddy (2006)
Due to the many high-throughput experiments currently conducted in molecular biology and enabled by high computational abilities, a wide variety of gene-expression and proteomic data is available for research and analyses. Analysis of multiple data types for a single purpose is an effective methodology for novel biological discovery. While the challenge of comparison and integration is amplified by the complexity and heterogeneity of these data sources, the flexible aptitude of data mining techniques offers promise for rapid analysis of such disparate data. The determination of active metabolic paths from a large network of paths is one significant area of such research. Our work proposes a novel data mining methodology designed to determine active and biologically significant metabolic pathways, and patterns within a given metabolic pathway, by combining pathway information and gene-expression data using association rule theory. In most of the publicly available metabolic pathway databases, the metabolic processes are represented diagrammatically, however, most of these diagrams do not indicate whether or not the presented paths are biologically active. By combining gene-expression data and metabolic pathway information we are able to confirm long-range, biologically significant findings in various well-known metabolic pathways: the Pentose-Phosphate, the Oxidative-Phosphorylation, Riboflavin Metabolism, and the Purine Metabolism pathway.
We compare our methodology with two major works previously performed in the field of determining active pathways. Because we studied a wider range of paths and also ranked our pathways according to the number of active sub-paths, our results have proven to provide a greater depth, and more accurate understanding of metabolic pathways. Rather than claiming the entire pathway as active, we have produced sufficient proof in this work to show that only certain sub-paths within them are active. We have confirmed our results with the KEGG and other reliable (and federally-funded) database sources, further solidifying the surety of our claim that the pathways which we have obtained are really biologically significant and active. In short, our method yields long term, biologically significant facts through the simple and effective implementation of an association rule-based technique that combines disparate data.
12. A Computational Framework for Structural Classification of Proteins Using Orthogonal Transformation and Class-Association Rules; MS-CS Thesis; Student: Praveen C. Kidambi (2006)
Protein structure classification and comparison has become a central area in the field of bioinformatics. The rapid increase in the size of protein databases has prompted the development of rapid, automated methods to classify unknown protein structures. Protein structural databases commonly suffer from the ‘curse of dimensionality,’ necessitating the development of novel dimensionality reduction of protein structural information prior to classification. Moreover, the design and development of efficient manual or semi-automated classification techniques have not kept pace with the growth in such databases. In this paper, we propose a novel, automated computational framework for the three dimensional (3D) structure-based classification of proteins using an orthogonal transformation of geometric shape descriptors derived from protein structures by employing an association rule-based, supervised clustering approach to classify proteins. This research incorporates two previously proposed structural descriptors, dihedral angle and bond length, to represent the 3D protein structure. The distributions of these descriptors over a sequence are then orthogonally transformed into corresponding signals in the frequency domain using DCT, followed by selective feature filtering. Associations between the coefficients produced by the DCT process are used to derive classes that represent a particular protein structure. Class-association rule discovery is used to identify such associations in a group of proteins that belong to a structural class. To demonstrate the sensitivity and specificity of the approach, we employ our method to two different datasets. The first balanced dataset consisted of 400 proteins from 10 families. The 3D protein structure information was extracted from the PDB files, referred family-wise from the SCOP database. We experimented with 1D and 2D DCT and found that higher classification accuracy (over 85%) was attained for 2D DCT. In our second experiment, we implemented our framework on a dataset of 600 proteins from 15 folds. Our method demonstrated an overall accuracy of better than 83%. Thus, the proposed novel computational framework demonstrates the applicability of rule discovery-based classification of structural descriptors for protein fold classification with improved sensitivity.
13. Fractal-based Method for Dimensionality Reduction of Gene Expression Data; MS-CS Thesis, Student: Sridhar Reddy Alluri (2006)
Gene expression analysis based on microarray data has been one of the emerging areas of research in the filed of bioinformatics. One particular application of microarray data is to uncover the molecular variation among cancers. One feature of microarray data is that the relatively small number of samples collected compared to the number of genes per sample. Many dimensionality reduction techniques like principal component analysis and other regression analysis techniques have been published. However, these techniques do not take into consideration the data set’s intrinsic distribution characteristics, which if properly used along with a good clustering technique could provide promising accuracy and performance. The main idea of our technique is to take into consideration the data set’s intrinsic distribution for dimensionality reduction.
We applied the fractal based clustering analysis tool to the problem of dimensionality reduction in microarray data. We tried to find a critical sized subset based on fractal analysis so that it would preserve the intrinsic dimensionality of the data that could be helpful in revealing biologically important information. Our results showed a 97% dimensionality reduction. Additionally, we were able to calculate the intrinsic dimension of the data and measure data distribution. Another observable advantage was the characterization of the spread of the data, which can be used to aid different data mining tasks. We checked the accuracy of our method by clustering the original and reduced datasets using hierarchical clustering. The observation revealed that, most of the class information was retained in the reduced critical sized subset. The original dataset provided a clustering accuracy of 82%, while the reduced dataset offered 75% accuracy in retaining the samples in their respective classes. Our framework not only provided us a good dimensionality reduction technique but could also be useful to biologists in revealing important biological information with the intrinsic fractal dimension.
14. A Computational Framework for Autonomous Comparison of Protein Classification Schemas; MS-CS Practicum, Student: Sireesha Krishna Guntaka (2006)
The completed research assesses Orthoprot and Dihedprot, two separately proposed, novel protein structural comparison and classification techniques. A computational framework has been developed to compare the performances of the research assesses against one another, and also against the Pride2 classification method. The objective of creating this computational framework (protein mining engine) is to allow proteomic researchers to compare and analyze the strength of the various protein classification techniques currently employed, and it shall be developed in the near future. To achieve this, the three classification techniques—Orthoprot, Dihedprot, and Pride2—have been ported to the World Wide Web using the Matlab webserver.
The Orthoprot classifier uses the secondary geometric descriptors of the phi dihedral angle and bond distance to represent a protein’s structure. The Dihedprot classifier employs two dihedral angles to represent a protein’s secondary structure. While the Orthoprot classifier performs a wavelet analysis using only a specified number of coefficients to represent each protein, a two dimensional Fast Fourier Transform is employed by the Dihedprot classifier to represent each protein using a specified number of coefficients. The first test of the protein mining engine used a 45 protein dataset to compare the strength of the Orthoprot and Dihedprot classifiers against that of Pride2. First, the classification results of Orthoprot and Dihedprot classifiers and the distance matrix of Pride2 were obtained from the computational framework. Then the standalone program was used to compute the dendrogram, percentage accuracy, and the kappa statistic of the Pride2 distance matrix. The second test was performed using an 80 protein dataset in order to compare the performance of the Orthoprot and Dihedprot classifiers to each other. To accomplish this, a dendrogram, a confusion matrix, the percentage accuracy, the kappa statistic, the false alarm, and the ROC plots for both classifiers were analyzed. In summary, the protein mining engine serves as a tool to compare and analyze the results of various protein structural comparison and classification techniques with those of the Orthoprot and the Dihedprot classifiers.
15. Optimized Greedy Algorithm Based Sensor Placement for Distributed Sensor Network; MS-CS Thesis, Student: Ankur Rajopadhye (2006).
Efficient sensor deployment is a critical issue which directly influences the cost and quality of any sensor network. Challenges in efficient sensor placement include power efficiency, maximum network life expectancy, pervasive coverage, connectivity optimization, and, taking all of these factors into account, cost optimization. In this work, a sensor placement algorithm for optimizing the number and locations of sensors to completely cover a sensor field is proposed. The framework assumes a probabilistic sensor detection model, considering the inherent uncertainty involved in sensor detections. Preferential coverage of vulnerable regions in the sensor field is desired for certain mission critical applications like battlefield surveillance. Such preferential regions are modeled as multiple arbitrary shaped regions in the sensor field. Obstacles like buildings, trees, and hills which can be present anywhere in the sensor field, including preferential regions, are modeled as multiple irregular shaped regions. Preferential coverage of n regions in the sensor field with an n-tier probability criterion proves the effectiveness of the proposed algorithm.
An Optimized Greedy Algorithm for efficient deployment of sensors in a sensor field containing multiple obstacles and preferential regions is proposed. The approach is based on placing sensors at pre-calculated distances in horizontal and vertical directions if the sensor location does not fall within preferential regions. The greedy algorithm with an optimization phase (pruning phase) is then used to find the best sensor locations to cover the leftover uncovered region.
The results prove that the proposed algorithm uses fewer sensors than MAX_AVG_COV, MAX_MIN_COV [13], and random placement algorithms.
16. Quad-tree Based Approach for Bi-clustering of Gene Expression Data; MS-CS Thesis, Student: Padma P. Korimilli (2006).
Advances in Bioinformatics can be mainly attributed to the ability of microarrays to rapidly and accurately monitor transcriptional behavior over a whole genome under different conditions. Over the years, clustering techniques have played a major role in microarray data analysis in discovering groups of genes that share similar transcriptional behavior over the conditions in microarray experiments. The existence of limitations when getting the results after applying clustering methods to the genes alone or to the samples alone has led to the development of new clustering methods which cluster the genes and the samples simultaneously. This method of clustering the genes and samples simultaneously, also called biclustering, co-clustering, or two-way clustering, searches for sub-matrices that exhibit high coherence. This thesis presents a unique biclustering schema that identifies highly coherent genes and conditions in microarray images. The technique of quadtree decomposition along with wavelet transformations analysis is carried out at various thresholds of multiresolution frequencies to retrieve decomposition of a microarray image. The resultant quadtrees at different thresholds are superimposed to identify the overlapping nodes. These nodes represent the potential biclusters of interest.
In this study biclusters with high coherency were retrieved. The biclusters were validated and compared with two of the well known methods in this field and significant number of genes was recovered in the found biclusters.
17. A Framework or Studying the Efficacy of Parameters Accounting towards Solutions in Data Mining; MS-CS Thesis, Student: Manish K. Gupta (2006).
Classification and prediction are the primary goals of data mining. Classification problems are primarily of missing data, feature redundancy, or high data dimensionality. These problems can cause high inaccuracies, and hence have received a lot of attention in data mining and machine learning communities. In this research, we aim to study the effects of removing the Gaussian attributes in the dataset on the performance of the classifier accounting towards data reduction and study the effects of Independent Component Analysis on the classifier performance accounting towards the feature extraction problem.
We run the classification experiments using a set of selected classifiers on a set of datasets chosen from the UCI data archive. Primarily, we run the chosen classification algorithms on the datasets in the first step. We then perform the statistical tests to determine the Gaussian attributes and study the effects of removing these attributes, moving toward data reduction, on the performance of the classifiers. Then, performing the independent component analysis in order to unearth independent components which are as “independent” as possible in the dataset, moving toward feature extraction, we run all the classification algorithms on these independent components and study the effects in the classification performance of the classifiers being used. Considering the JPSO dataset as a special case, we then classify the JPSO dataset and study the classification performance of all the studied classifiers on the JPSO dataset and also study the effects of removing Gaussian attributes and ICA on the classification models for the JPSO dataset.
We evaluated our framework by running a set of seven different classification algorithms on six datasets chosen from UCI data archive. The elucidation of the data reduction problem by removing Gaussian attributes and the clarification of the feature extraction problem by performing Independent Component Analysis on the datasets does not boost the classification performance of the classifiers. The classification accuracy decreases at an average of 3.34% after removing the Gaussian attributes, and at an average of 5% after performing ICA, though better accuracies than previous benchmarks can be obtained after successfully reducing the data and extracting features as independently as possible.
18. Protein Structural Classification Using Mining of Frequent Patterns in Concave Protein Surfaces; MS-CS Thesis, Student: Shirin A. Lakhani (2007).
Protein structural classification is an overriding problem in the field of Bioinformatics, and specifically in the in-silico functional annotation of proteins. Classifying proteins based on sequential and structural features using the conventional methods is known to be arduous and inaccurate, partially due to the weak representation of the subunits of the protein that provide its discriminatory behavior. The availability of high dimensional sequence and structure databases has ignited the demand for computational methods that proficiently evaluate the similarity of protein structures and accurately classify them into their respective classes. In recent years, there has been growing interest in classifying proteins using the surface information of a protein. Protein surface regions, specifically concave surfaces provide specialized regions of biological activity. Well-formed concave surface regions are therefore examined to identify any similarity relationship that might be directly related to protein function.
In this thesis, we propose a new association rule based technique using the concave residues and residue parameters of proteins to find the frequent spatial arrangement of residue which is unique to a particular family of proteins. Association rules for all classes of proteins are discovered that satisfy minimum support and minimum confidence constraints for class-level rule discovery and appraisal. Classification Based Association (CBA) rule mining is used to discover frequent patterns that are present on the concave protein surfaces with an aim to discover a small set of rules satisfying minimum support and minimum confidence.
It is empirically observed that association rules have proved to yield better results than other traditional techniques reviewed. We have also discovered and used the item-sets (attribute aggregates of protein surface) or residue parameters that are frequent for a class. Rules that satisfy minimum thresholds are extracted and employed for classification purposes. A query protein is subjected to the method defined to extract the association rules to compare the protein with the rules generated during the training phase. The protein is classified into a structural class whose rules best satisfy its features with enhanced degrees of specificity and sensitivity of protein structural classification.
19. A Simplistic Approach to Face Detection over the Access Grid Medium; MS-CS Thesis, Student: Robert W. Clowers (2007).
Access Grid (AG) frameworks are a relatively new medium for enhanced remote instructional delivery, collaborative education, and group learning endeavors. The declining cost of audio/visual and computing equipment, coupled by the increasing level of interest and support for virtual classrooms, have led to an unprecedented growth of AG nodes in academia and beyond. While it would be presumptuous to assume that Access Grids are an adequate replacement for classroom teaching, the advances in computing technologies deployed augmented by novel algorithms developed to analyze the AG media holds promise for an enhanced learning experience and subsequently for further functional utilization of the AG capacity. This thesis has attempted to make a unique contribution in this area of algorithmic technology-driven enriching instructional experience.
We have presented a unique algorithmic framework to model the problem of online feedback from remote attendees as a challenging image processing and mining problem. A unique face detection algorithm has been proposed with tunable parametric controls to manage the heterogeneity of AG environments. While a typical AG room will have less than 30 attendees, we have rigorously tested the facial detection accuracy of our algorithm on a variety of complex images leading to an accuracy of 90% in a complex image comprising of 62 faces. Superior accuracy rates have been achieved for less complex images typically encountered in an AG environment. The processing time supports a linear progression in proportion to the image size. While the elucidation of the seamless integration of digital imaging capabilities in live AG environments is far from complete, we believe that this thorough proof-of-principle study will lay the groundwork for future advances in the area of semi-automated facial recognition and attendance management in AG mediums towards more frequent and reliable use of such mediums in the future.
20. Unsupervised Feature Selection Filter Method Based on Information Gain; MS-CS Thesis, Student: Feifei Xu (2007).
Fast matured microarray technologies have allowed scientists to monitor and measure the gene expression levels of thousands of genes in a single experiment. But the high dimensionality of the microarray data has become a challenge to discrimination analysis. We need to find ways to reduce the dimensionality and keep the characteristics of the dataset. To this purpose, different feature selection methods have been developed. But most of those methods can only remove the irrelevant features and can not remove the redundant features. Therefore, the accuracy of the prediction is reduced by those redundant features.
We propose a novel unsupervised filter method, information gain based measurement (IGM), to select features. Redundancy is reduced in the feature selection process while more information of the original dataset is kept. Improvements are observed when we use the Kmeans method to test the features selected. We also get a high accuracy by using the feature selected via our method. Extensive experiments demonstrate the effectiveness of our method compared with existing methods.
21. Wavelet Based Approach for Detecting Cognitive States in fMRI Images; MS-Biomedical Engineering, Student: Priti Srinivasan (2008).
The functional Magnetic Resonance Imaging has evolved as a major tool for analyzing brain activity. Over the past few decades this has been used for detecting cognitive states of human subjects. Our main aim is to develop an algorithm that provides automated assessments of a patient’s cognitive state and provides decision support to clinicians in treatment planning for patients with various brain disorders. The fMRI data is high dimensional in nature and hence dimensionality reduction and feature extraction are two important steps for the representation of cognitive states for decision support analysis. The set of cognitive states that we are interested in classifying in this paper are ‘a person reading a sentence’ and ‘a person reading a picture.’ In this paper, we describe a unique slice based approach in which feature extraction is done by converting the data into frequency domain using Discrete Wavelet Transform (DWT) for getting novel features to represent a cognitive state. We believe that the frequency based approach captures the distinct trend associated with a particular cognitive state. Dimensionality reduction in frequency domain is done using Principal Component Analysis (PCA). The feature vector thus constructed is tested using different machine learning classifiers. Our results show good performance for multi-subject classification with much reduced dimensionality compared to most of the voxel based approaches.
22. Gene Ontology Based Gene Expression Mining; MS-CS Practicum, Student: Kameshwari Palepu (2008).
Over recent years, DNA microarrays have become the key tool in functional genomics and have become the advanced standard for understanding the underlying regulatory mechanisms of a cell. In order to extract the genes’ biological information and to gain a better understanding of the dataset for better analysis, it is vital to incorporate external biological information about genes. Hence, we have proposed a framework that performs the gene expression analysis by first incorporating biological knowledge and then by clustering the obtained features.The biological knowledge we incorporate is obtained from the Gene Ontology (GO) databases. A considerable amount of research and experiments have been done in the area of GO based gene expression analysis, but most of the research and experiments contain a common problem: they depend on the gene annotation’s statistics to calculate the similarity of the genes. This method of calculation would result in variation among the biological similarity values between the genes.
The goal of our proposed study is to utilize the GO annotations together with gene expression data, in order to measure the functional similarity among the genes. The gene expression data of yeast cells was used in our study. The Optimistic Genealogy Measure was used to measure the similarities between the GO terms, which considers both the genes’ statistics and its demographic location to calculate the genes’ functional similarity. An empirical comparison of our results with another similarity technique proposed by Wang et al. was performed in order to validate our similar matrix and to present the superiority of our similarity measure over the existing ones. The results produced by our framework were more accurate, in that our framework grouped a set of genes showing similar biological significance into clusters.
Hence, this approach can lead to more biologically meaningful clusters.
|
|
|
|