IntroductionThe purpose of this page is to give the reader a basic introduction to the fields that surround Bioinformatics. A simple definition of Bioinformatics is: any field in which biological data is stored or manipulated using computers. However, the rapid growth in technology and science over the last few years has increased the complexity of the relationship between computer science and biological research (particularly in the field of genetics). By breaking the focus of Bioinformatics into three relatively specialized fields of study—Bioinformatics, Data Mining, and Biomedical Informatics—we elucidate the wide ranges of applications and purposes that computational methods serve in a biological world. This introduction will also answer a few of the most common questions we receive about our research here at DMRL, which is involved in all three fields. Welcome to Our Highly-Dimensional WorldOver the past decade, clinical and research-oriented medical disciplines have become increasingly data-intensive. Advances in automated data collection technologies in these domains have led to an unprecedented growth in the size, complexity, and quantity of collected data, a large proportion of which is currently inaccessible for analysis by computational scientists. The information generated from the human genome project alone is approaching a petabyte—an amount of raw data approximately equal to the information contained in 102 Libraries of Congress [1]. Furthermore, in direct health-care applications the computerized medical record is on the verge of becoming practical, and soon after will be a business necessity for health-care providers, further swelling data growth. This growth will lead to renewal of demands for the development of novel technologies designed for the organization and mining of data to enhance computing and biomedical research. The data that results from such biological and clinical endeavors is both heterogeneous and highly dimensional. By heterogeneous, we mean that the data may be in such diverse formats as text documents, confocal microscope images, fluorescing arrays, or complex tree-like data structures. What’s more, any one of these mediums may have several important dimensions by which it can be measured and classified. For instance, a picture of a person’s face may be classified according to color, texture, or by geometric or statistical characteristics—or better yet, by a single complex descriptor that includes all of these features. Needless to say, the combination of the factors of heterogeneity and dimensionality in biological and clinical data poses formidable challenges in the areas of data acquisition, storage, organization, analysis, and visualization. The fields of Bioinformatics, Data Mining, and Medical Informatics seek to address these overlapping issues of organization, interpretation, prediction, and presentation of biological data. Bio-“What???”Organize—Interpret—Predict—Present
As mentioned above, the sheer volume of data currently generated makes it impossible to manually interpret. When faced with this influx of data, one of the first questions that must be faced is: where do we put it? The storage of such vital data is not a mere question of throwing it in some data dungeon to languish, but involves skilled management with an eye toward reliable and efficient retrieval and use. Originally, Bioinformatics referred broadly to any use of computers to handle biologically-derived data. However, due to the escalating size and complexity of the issues surrounding computational biology today, Bioinformatics is beginning to be considered the field of research that focuses on content (data recording, annotation, storage, analysis, searching/retrieval). Thus, the primary focus of Bioinformatics is on database-like activities involving persistent sets of data that are maintained in a consistent state over essentially indefinite periods of time (e.g., database creation, data management, data warehousing, etc.). To perform this function efficiently, often dimensionality reduction must be performed to simplify the representation of the data.
Data MiningOrganize—Interpret—Predict—Present
While Bioinformatics is a generally retrospective practice that focuses primarily on content, Data Mining is a prospective science that focuses on the discovery of previously unknown relationships among existing data. In order to predict future trends and behaviors, allowing proactive, knowledge-driven decisions, computer scientists seek to extract patterns from the data using techniques such as classification, regression, link analysis, segmentation, and deviation detection. Think of it as the practice of unearthing valuable material from data warehouses that are data rich, but information poor, just as we mine the earth to recover precious metals. This process of seeking new patterns, associations, and correlations is also called “knowledge discovery.” Data Mining concepts can also be applied to images (this is referred to as image mining). In image mining, algorithms are designed to automatically and autonomously locate and/or isolate characteristic features in stored images. This type of mining can be helpful for identifying molecular structures in microscope images, detecting and recognizing faces in photos, or for real-time tracking applications. In the final analysis, the overarching goal of data mining is to educate ourselves about the nature of the world around us in order to better equip us for future action and research.
Biomedical Informaticsorganize—interpret—Predict—Present
After data has been carefully collected, organized and managed (Bioinformatics); and following the penetrating discovery of trends, correlations, and association (Data Mining); the resulting information and knowledge must be implemented for any benefit to be gleaned from the harvest. Biomedical Informatics focuses on the integration of data sources and the development of tools that can make use of data to aid the clinical decision making process. These tools include structures and algorithms necessary to improve the communication, understanding, and management of medical information. The largest factor that separates Biomedical Informatics from other disciplines is its focus on manipulating and presenting information in an enabling format, rather than strictly on characterizing or mining content [2].
Convergence – A United EffortOrganize—Interpret—Predict—Present
For the purpose of clarity, the scope of the above descriptions has been limited and clearly defined. However, in real world applications, these disciplines overlap and collaborate to work as a united front in problem solving and knowledge discovery. The following example, borrowed from Dr. Maynard Olson (Department of Molecular Biology, University of Washington), provides insight into both the delineations and the unity of the effort to increase our knowledge:
“Suppose an alien from another galaxy arrived on earth and encountered a computer. In an attempt to understand what the computer was and how it functioned, the alien might measure its size, weight and color. He might take it apart, note how the components were connected together, and perform lab tests to determine its chemical composition, discovering, for example, that silicon was a major constituent of certain tiny chip-like components. Perhaps he would discover that the computer was powered by electricity, and would identify a particular component as a power supply. But all these observations would be somewhat irrelevant, as they would not help the alien understand the purpose of the computer, or how it accomplished that purpose. If the alien somehow realized that the purpose of the computer was to process information, his attention would immediately focus on the really interesting questions about the computer: how does it represent, store, process and transmit information?
“In the 1950s and 1960s the basic mechanisms by which living cells process information were discovered. From that time forward the fields of molecular biology, cell biology and genetics have focused on the way cells represent and transform information, and how this information bears on heredity and on the chemical processes of the cell.” [3]
Genes are the smallest sequence of molecules connected in a strand of DNA that stores information about heredity. They also carry the information that instructs other cells within the body on how to function properly. For instance, genes are responsible for coding the proteins built in the body that are essential to every aspect of our physiology. The molecular sequence of the gene itself holds the key to this “code.” If the code of a gene can be captured, it will help to reveal the structure of the protein it will build. The structure, in turn, is a major factor in defining the protein’s function. Ultimately, if the parent gene of each protein can be discerned, we may be able to treat diseases and improve human health on a sub-molecular level. The disciplines involved in this process can be compared to the investigations of Dr. Olson’s alien.
After the computer had been slowly and carefully broken down into its smallest components, the alien would catalogue the make-up of each item and store his findings in an accessible manner. This work is similar to the role of the Bioinformatician, who receives all of the information concerning the discovered gene sequences and organizes, stores, and manages the data so that it can be easily retrieved and accessed. When the alien begins to sift through his data, noticing constructions similar to other items it has seen before which enable it to propose educated, data and knowledge-driven proposals about their functions, it is performing the essential role of the Data Miner. In the case of genomic research, Data Mining employs numerical methods to compare and analyze correlations and associations between newly discovered gene sequences and the function of similar genes whose function is already known. Finally, the alien will be able to present his finding in a fashion that might allow a new, functioning computer to be built—or to locate deficiencies in a broken computer and fix them. The Biomedical Informaticist fills a similar role where he or she manipulate genetic information so that suggested gene functions might be tested in a wet-lab, or when mined information is presented to a clinician in a decision-enabling format. Overall, these disciplines form a collaborative effort to master biological information, including the discovery of underlying rules, relationships, and meanings, through the use of inquisitive and human intelligence and intuition—leveraged by computer-based tools [4]. The integration of this discovered knowledge for successful modeling and simulation from genome to cell, and ultimately to entire organisms, is an exciting and promising field of research. Our Research at DMRLHere at Louisiana Tech University, Dr. Sumeet Dua and his team of graduate and master’s students are involved in every aspect of this research—from efficient data storage solutions to penetrating data mining discoveries and decision-enabling results presentation. Take a look at our other pages, particularly our Research Projects page, to read more about some of the ground-breaking progress we are currently researching. Our key research areas include genetics (sequencing, gene expression, etc.), protein structural mining, biomedical image mining, and data integration. DMRL and YouDoes this research appeal to you? Do you desire to be on the cusp of a burgeoning field? If so, you may have questions. What do you need to enter the fields surrounding Bioinformatics? Two of the most common questions that we encounter are: “How much biology is involved in Bioinformatics?” and “What computing background is necessary to take part in Bioinformatics research?” To answer the first question: though these fields focus on handling biological data, very little knowledge of biology is actually necessary to perform the research. These fields are primarily concerned with computer science. However, any domain knowledge will help to provide insight into the trends and correlations you might discover, and also help in the construction of novel, effective data mining algorithms. Computers and programming, on the other hand, are used extensively and exhaustively in this research. Since most of the work involves the design of algorithms, at least a basic knowledge of programming is required. Specifically, the most common languages currently in use are Matlab, PERL, and R. Since many languages are essentially similar, knowledge of any one language will transfer to make learning new languages quicker and easier. The most important requirement of this research, however, is a willingness to learn, to actively apply your knowledge, and to work hard.
One of the other common questions we receive is “What can I do with this stuff once I graduate?” With an estimated $1.4 billion U.S. market expected to increase to a $3 billion market by 2010 [5], there is plenty of room for the field of bioinformatics to grow. Until recently, most sponsorship for Bioinformatics research came from drug companies hoping to discover new, highly specialized therapeutic drugs. However, the market is expanding to include everything from waste management and agriculture, to personalized medicine, veterinary science, and comparative studies [6]. These opportunities include possible employment at research centers and academic institutions, as well as pharmtech and biotech firms. Outside of research, many opportunities are also rising from the fields of software engineering, data management, and sophisticated computer and network systems that support research, development, and production activities. In short, as a relatively new science, we are only beginning to understand the widespread applications and potential that bioinformatics has to offer. One thing is assured; society is relying more heavily on computers and, therefore, computer scientists every day. The programming and software skills that you would develop as a Bioinformatitician will be skills that can readily transfer to any number of other markets and endeavors. Interested?If you are interested in DMRL, feel free to peruse the other information we have made available on this website. If you have any further questions, or would like to know how you can get involved, please contact us, using the information below.
|