|
|
||||||||
Molecular & Cellular Proteomics 1:763-780, 2002.
© 2002 by The American Society for Biochemistry and Molecular Biology, Inc.
a College of Pharmacy, University of Michigan, Ann Arbor, MI 48109-1065
b U.S. Environmental Protection Agency, Research Triangle Park, NC 27711
c The Rockefeller University, New York, NY 10021
d Keck Graduate Institute of Applied Life Sciences, Claremont, CA 91711
e Department of Molecular and Cellular Biology, University of California-Berkeley, Berkeley, CA 94720-3206
f Department of Biochemistry, Albert Einstein College of Medicine, Bronx, NY 10461-1975
g MitoKor, San Diego, CA 92121
h Department of Chemistry and Biochemistry, Rosenstiel Basic Medical Sciences Research Center, Brandeis University, Waltham, MA 02254-9110
i W.M. Keck Institute for Cellular Visualization, Rosensteil Basic Medical Sciences Research Center, Brandeis University, Waltham, MA 02254-9110
j Howard Hughes Medical Institute, Chevy Chase, MD 20815-6789
k Program Officer, Board on International Scientific Organizations, Policy and Global Affairs, The National Academies, Washington, D.C. 20418
| ABSTRACT |
|---|
|
|
|---|
The planning committee selected speakers (see Table I) and designed the symposium in the hope that one of the outcomes of the meeting would be helping to set the field on as wise a path as possible for the future. After the presentations attendees were involved in individual breakout sessions on a variety of topics, including
|
| PROTEOMICS |
|---|
|
|
|---|
Historically one can point back to meetings and articles over 20 years ago, when scientists began to think about mapping the entire set of human proteins (see, for example, B. F. C. Clark, "Towards a Total Human Protein Map" (1)). Indeed, Congress was considering a project called the "Human Protein Index," long before the Human Genome Project had been conceived. The Human Protein Index project was developed in the late 70s by Norman G. Anderson and N. Leigh Anderson at the Department of Energys Argonne National Laboratory (2). Its objective was to enumerate the human proteins (what would now be called the human proteome) by separation on 2-D gels and thus define their genes from the protein end, the only approach possible in those days before large scale DNA sequencing was possible. But this effort was perhaps ahead of its time given the lack of suitable technologies and shifting political sands. Instead, the rise of genomics took center stage. An Australian postdoctoral student, Marc Wilkins, is often credited with coining the term "proteomics" in 1994 (3) at a time when only one proteomics company existed (Large Scale Biology Corporation).
Today many proteomics initiatives are underway in industry and otherwise, such as the Human Proteomics Initiative (HPI), an effort which began in 2000 by the Swiss Institute of Bioinformatics and the European Bioinformatics Institute. The goal of the HPI is to annotate each known protein, providing information that includes the description of protein function, domain structure, subcellular location, post-translational modifications, splice variants, and similarities to other mammalian proteins (4). Another major proteomics effort is led by the Human Proteome Organization (HUPO), a group which has created a worldwide organization that engages in scientific and educational activities to encourage the spread of proteomics technologies and to disseminate knowledge pertaining to the human proteome and that of model organisms (5).
On which goals should these national and international efforts focus? Should they be limited to human proteomics or like the Human Genome Project, include key model organisms? Perhaps the proteomes of the human pathogens should be included as well (e.g., the malaria parasite and other infectious microorganisms), and if so, in what order of priority? Should development of more efficient instrumentation (e.g., mass spectrometers, X-ray diffractometers, nuclear magnetic resonance spectrometers) and improved computational methodologies (e.g., high-speed computers and software useful in bioinformatics) be emphasized? What should be the role of major federal funding agencies (e.g., the National Institutes of Health, the National Science Foundation, the U.S. Environmental Protection Agency, and the U.S. Department of Agriculture)? What should be the role of academic laboratories? Should projects be supported mostly by individual research grants or program project (group effort) grants? What should be the role of the private sector, particularly those companies large and small that have a major stake in exploiting the results of the various genome projects and proteomics initiatives? How can all of these stakeholders cooperate most effectively while still maintaining proprietary information where appropriate? Should the overall goal be to understand the structure and function of all known proteins or should only those known to be involved in diseases be emphasized? After all, one must first understand function if one is to fully understand dysfunction. Is enough emphasis being given to the functional aspects of proteomics? Are studies on post-translational modifications of proteins and subsequent functional aspects included in "proteomics?" Hence the interest in organizing the one-day symposium reported herein.
| DISCUSSION OF GENERAL TOPICS COVERED AT SYMPOSIUM |
|---|
|
|
|---|
Somewhat limited operational definitions of proteomics were offered by some of the speakers. For instance, "In one sense it makes no difference at all why should you call something proteomics or call it something else?" Dr. Cassman continued, "What we call things often conditions how we organize our thinking and our efforts." He explained that genome-driven target selection coupled to high-throughput technologies is what he believes structural genomics means. "It means you are using the genomes as the primary source for target selection." However, structural proteomics uses these features "plus the additive feature of full coverage of protein space, that is, completeness" stated Dr. Cassman. The goal of completeness does not intend to suggest, however, that any smaller scale experiments, even including high- throughput analysis of specific tissues or subsets of proteins, would not be considered to be part of proteomics.
Of course there are many "-omics" along with proteomics including genomics, metabolomics, transcriptomics, interactomics and so on, which are collectively involved in the mandate of defining proteomics. However, we will restrain ourselves from commenting on other "-omics." Functional genomics and functional proteomics (which can encompass other omics as mentioned) are closely juxtaposed on a continuum along the path of discovering the detailed secrets of life and life processes.
The general topics covered at the symposium included
Dr. Cassman defined proteomics as a set of related options: "the analysis of complete complements of proteins present in defined cell or tissue environments (i.e., context-dependent) and their variation in space and time" (with credit given to Stan Fields for his contributions to this definition). One example of a proteomic effort is the Protein Structure Initiative of the National Institutes of General Medical Sciences (NIGMS), which has as a goal the generation of a complete complement of protein structures in nature through the combination of direct structure determination and homology modeling. Although it requires high-throughput technology and genomic data to use for target determination, the goal of "completeness" is what distinguishes the effort as proteomics, according to Dr. Cassman.
The second part of his definition is exemplified by the use of microarrays to identify characteristic markers for cancer progression in specific tissue samples. These studies involve image and pattern recognition tools, which yield large-scale visualization of specific cell-dependent, context-dependent proteomic outputs.
The third part of the definition involves examining proteomic outputs in time and space. This requires not only the application of bioinformatics tools but also computational biology, that is, the use of modeling and simulation. Complex systems analysis could be considered an important element in the larger picture of defining a proteome, and such analysis will require theoretical modeling of systems. Several examples of NIGMS initiatives that focus on mathematical modeling of complex biological systems were provided. One example of this is the protein structure initiative or structural genomics as some may call it, which is discussed later in this report.
While we may be far off in terms of defining a complete human proteome, approaching proteomics on an organellar basis provides goals that are perhaps achievable in our lifetimes. Remember that the first DNA genomes sequenced were those of the bacteriophage, in the 1970s, followed in 1981 with the DNA sequencing of a human mitochondrial genome.
Consider also that the mitochondrion, which is estimated to be composed of about 2,000 proteins, presents a considerably more manageable problem and a microcosm of whole cell proteomics. With this in mind Nobel laureate Sir John Walker, head of the Dunn Medical Research Council Unit in Cambridge, UK, discussed his proteomic studies of mitochondria directed to resolving specific biological issues. Dr. Walkers work includes the definition of the protein complement assembled in the respiratory enzyme known as complex I, the identification of the biochemical functions of a family of transport proteins found only in mitochondria, and the discovery of phosphorylation-dephosphorylation pathways in mitochondria. These studies rely not only on mass spectrometric and bioinformatics tools but also on biochemistry and genetics. Such an integrated approach is proving to be quite rewarding in Dr. Walkers view, in terms of both understanding the biology of mitochondria and the technical development of new methods versus attempts to analyze the global complement of proteins in the organelle. It is also possible to focus on subcompartments of mitochondria, such as the inner mitochondrial membrane of so much interest to bioenergeticists.
In this report we have tried to avoid being constrained by a narrow definition of proteomics (e.g., merely quantitating protein levels) and have used the broad definition given earlier to allow a wide-ranging discussion of goals, techniques, opportunities, and challenges.
| LESSONS LEARNED FROM THE HUMAN GENOME PROJECT |
|---|
|
|
|---|
|
Dr. Collins said that the most important area for investment in proteomics right now is technology development so that we can move these methods in the direction of being able to tackle a mammalian proteome without facing enormous costs and problems with quality of the data.
A number of resources for genomics research continue to be generated that may help inform a proteomics effort, including multiple coverage of certain genomes and more specifically:
Dr. Collins referred to one publication: "Global Analysis of Protein Activities Using Proteome Chips (8)." He finished his presentation with a particular recommendation, not from a scientist but from a famous athlete (hockey star Wayne Gretzky). When asked how it occurred that he was so good at playing hockey, and why it was that he always seemed to score the key goals, Gretzky said, "It is very simple. You have got to skate where the puck is going to be." In the field of proteomics Dr. Collins said he was not sure where exactly the puck was going to be, but there were a lot of "Wayne Gretzkys" at the meeting, and Dr. Collins was glad to get a chance to listen to them.
| SOURCES OF PROTEINS |
|---|
|
|
|---|
To this end Dr. LaBaer described the FLEX Gene repository, which is currently being assembled by a consortium of about 20 different public and private research laboratories. "FLEX" stands for Full Length Expression ready. This repository will enable scientists to move several genes simultaneously from the master vector to any expression vector, which will allow researchers to screen for function by high-throughput experimentation. It is the intention of this consortium to make this collection of all human genes broadly available without restrictions on their use. The four self-defined objectives of the consortium are (1) identification of the genes, (2) assembly of clones, (3) sequence validation, and (4) distribution to the scientific community. One example of the success of this effort resulted in the identification of two new genes that are likely involved in the migration of breast cancer cells through a membrane. The collaboration of public and private research groups raises certain legal issues, which include consideration of antitrust law.
Recombination-based cloning was presented as a high-throughput technology to enable the ready transfer of cDNAs from the supplied vector to ones own preferred expression vector. Dr. LaBaer described a protein purification scheme that was developed by a graduate student in his laboratory, Pascal Braun. "In the case of human proteins," Dr. LaBaer explained, "where it is not easy to produce these proteins in human cells, [the availability of large numbers of purified proteins] will require the use of heterologous [expression] systems such as bacteria." "To develop these methods," continued Dr. LaBaer, "Braun transferred a collection of 30 cancer genes into four different expression vectors, each one adding a different epitope tag. [Braun] then developed a two-hour automated protocol for purifying 96 proteins in parallel [and] has now purified over 330 different proteins using this approach." Braun and Yanhui Hu of the lab created a database that correlates the success of purification with various features of the proteins such as pI, GO annotation, subcellular localization, and domain structure. Dr. LaBaer said they found that the presence of certain domains such as SH2 domains or SH3 domains can predict success in purification.
Dr. LaBaer concluded with a description of a database derived from a computer program that searches the primary literature for abstracts that mention both a gene and a disease. The assumption is that a significant number of such occurrences may identify groups of genes associated with a given disease. This effort was presented as a task in progress, and interested scientists were invited to experiment with the database (9).
Brian T. Chait from Rockefeller University described a proteomics approach to understanding cellular function. His group is interested in mechanisms by which materials enter and exit the nucleus, the isolation of multiprotein complexes and to the determination of their cellular localization. The basic concept is to introduce a particular affinity tag to one of the proteins at its natural location in the chromosome, which is done by replacing the endogenous gene by a gene that will code for a protein with a tag on it or as he termed it, "a piece of molecular Velcro." So long as the multiprotein complex is stable, the tag allows isolation of the associated interacting proteins. An application to the nuclear pore complex, a group of proteins involved in nuclear trafficking, was described extensively. The complex as isolated has a molecular mass of 50 million daltons. Interestingly, in the initial purification experiments it contained about 180 interacting proteins, but upon further fractionation only around 50 were found to comprise the complex. The individual proteins are identified by mass spectrometry, which has the power to provide additional information about phosphorylation sites.
Preliminary experiments describing the use of this approach to follow proteins at different points in the cell cycle and in the regulation of chromatin were mentioned briefly. The genomic tagging and mapping approach can be used to gain analogous information about a number of other systems. Most importantly this approach can show where the protein is localized within the cell, how much is present, when the protein is present and for how long, with what it is interacting, and even something about the topology of the protein complexes.
| PROTEIN SEPARATION |
|---|
|
|
|---|
Denis Hochstrasser from the University of Geneva, a founder of GeneProt Inc., GeneBio SA, the Swiss Institute of Bioinformatics, and one of the pioneers in the identification of proteins in 2D gels, took the lead in dealing with the topic of protein separation. He stated at the outset that he wanted to play the role of "devils advocate": to describe some of the excitement in proteomics but also to describe some of the difficulties. He outlined the scale of potential proteins one can look for in the millimolar (10-3), micromolar (10-6), nanomolar (10-9), picomolar (10-12), femtomolar (10-15), attomolar (10-18), zeptomolar (10-21) and yoctomolar (10-24) (which is less than one molecule per liter) ranges. When one considers human blood, for example, Hochstrasser noted, "typically you only see albumin, immunoglobulin, and transferrin," whereas cardiac markers such as troponin are present at nanomolar concentrations, and insulin-like growth factor or insulin are in the picomolar range. Parathyroid hormone is in the low picomolar range and Tumor Necrosis Factor is found in the femtomolar range (see Fig. 1).
|
For experimental studies the amount of starting material, such as blood, is considerable in order to have high enough levels of various protein material that can be detected by todays methods. Since a 2D gel has a dynamic range of only 104, Hochstrasser stated, "if anyone used [a] 2D gel from crude plasma, you never go below the micromolar range." Hochstrasser noted, for example, that starting with 1 mL of sample leads to roughly a nanomolar limit of detection. He further explained that starting with a much larger volume (e.g., 510 liters of plasma) is necessary to achieve detectability in the lower picomolar range. Clearly, prefractionation of proteins, individually, or as a subgroup is essential to reach the dynamic range of detectability required for both cell and tissue lysates, and plasma.
In subsequent discussion it became clear that even the best large-format 2D gels are inadequate for studies of the global range of expression, perhaps still inadequate by a factor of 10; therefore at least a 10-fold fractionation prior to large-format 2D gel separation would be required. Unfortunately, many membrane proteins do not enter 2D gels effectively. This presents a formidable challenge for the field.
In his presentation, Julio Celis from the Institute of Cancer Biology and the Danish Center for Human Genome Research in Aarhus, Denmark, also spoke about methods and challenges in the area of protein separation. He stated that "for the study of tissue biopsies the use of high-resolution 2D electrophoresis is the method of choice [for separations] as non-gel high-throughput technologies based on chromatography-mass spectrometry are not yet ready for the study of tissue samples." He stated that 2D gel technology in combination with mass spectrometry can be used to establish comprehensive databases of protein information that can be useful in the clinical setting. He also made the important point that data in a given cell type can be valuable to the study of other cell types since 8090 percent of the proteins are believed to be shared by all cell types. While many structural and metabolic gene products may be the same between all cells, as one reviewer pointed out, cell-specific proteins will be important for understanding function and disease.
An afternoon breakout session, devoted to the topic of "protein separation and identification," was led by Julio Celis; Alain Van Dorsselear, Louis Pasteur University, CNRS; and A. L. Burlingame of the University of California, San Francisco. Most of the 16 discussants were experts in mass spectrometry. The discussants concluded that the issue of sample preparation and purification has been sadly neglected at most meetings dealing with proteomics. There was the impression among some of the discussants that protein biochemists were developing and using methods to purify proteins that were not being adequately defined compositionally by mass spectrometrists interested in proteins. They envisioned setting up "core centers of excellence" in proteomics where innovation, mobility of people and ideas, and training can all occur. These core centers might also lead to spin-offs for the development of new instrumentation. Resources required to support a broad proteomic effort could be in the form of sample collections, standardization of data across platforms, and ligands that allow assaying of individual proteins, to name just a few. These centers would complement the work of scientists in individual, relatively small laboratories where more open-ended, curiosity-driven research can occur. Even when the advent of better strategies for protein mixture fractionation are in hand, new developments in mass spectrometry are needed to extend the dynamic range of detectability of protein samples, especially for proteins that are post-translationally modified.
| PROTEIN IDENTIFICATION |
|---|
|
|
|---|
Denis Hochstrasser brought into sharp focus the disparity between the sensitivity of current protein-detection methods and the proteomics communitys expectations regarding the sensitivity required to identify the complete proteome of a target cell or tissue. Cell and receptor based assay systems can detect peptides in the femtomolar range. These methods define the lower limit of our detection ability but, unfortunately, are applicable only to small sets of proteins. 2D gel electrophoresis, still an experimental mainstay in the proteomics community, can detect protein concentrations as low as micromolar, a sensitivity sufficient to identify
100 plasma proteins, not including modified forms. Under ideal conditions, including the sieving out of abundant proteins, mass spectrometry can extend sensitivity three orders of magnitude to the nanomolar level. Mass-spectral proteome screening is being carried out on an industrial scale by GeneProt Inc., one of the worlds first large-scale proteomic R&D centers. The facility houses 40 Tandem Mass (MSMS) spectrometers, each serving two High Performance Liquid Chromatography (HPLC) machines. With each spectrometer running two samples per hour, the facility is capable of performing 1,920 MSMS-characterized HPLC profiles per day, remarkably with very little human intervention. The sensitivity can be extended to the picomolar range by preparing single, mass-spectrometry samples from 1015 liters of the target.
Using a strategy that circumvents the need to detect a protein or to know its molecular function, Dr. Hochstrasser and colleagues are synthesizing proteins as large as 25 kDa in sufficient quantity and purity to immediately search for the effects of overexpression after injection in living systems, allowing them to move quickly from interesting protein candidates identified using informatics screens, to cellular or organism physiological response.
Dr. Hochstrasser underscored the well-known fact that mass-spectral identification is in general far more successful for peptides than proteins. He outlined a technological innovation that integrates the protein-separating resolution of 2D gels with the sensitivity of peptide-mass spectrometry. The method sandwiches a protease-impregnated membrane between a delivery membrane (that carries protein previously transferred from a 2D gel) and a capture membrane. The proteins are cleaved into peptides during electrophoretic transfer to the capture membrane, and are then desorbed from small registered sections of the membrane, using a laser, and finally delivered directly into the mass spectrometer. Dr. Hochstrasser called this new technology "The Molecular Scanner" (see Fig. 2).
|
The consensus of the group was that many areas are in need of development if the goal of defining the composition and behavior of proteomes is to become a reality. The perceived needs ranged from technologies that can determine the organization of the cellular matrix and the functions of the proteins in it, to single-cell proteomics, and the comprehensive analysis of post-translational modifications. Important practical issues were raised, like the need to standardize data across different technology platforms and how to organize the enormous volume of information being created daily. "The bottom line is, there is just a lot of work to be done," said Dr. Van Bogelen. "We need money invested into developing technologies. And we really need to have students in this area who are moving this field into the next generation."
| DATA COLLECTION |
|---|
|
|
|---|
The technologies to perform various types of proteomic measurement are not mature and thus are limited in capacity (see Fig. 3). Dr. Aebersolds group has developed a general approach to quantitative proteomics based on automated tandem mass spectrometry, stable isotope dilution theory, and a suite of bioinformatics tools for data analysis (10). Dr. Aebersold described the approach as follows: "Stable isotope signatures are introduced into proteins at specific sites by means of chemical reactions. Later these signatures are deconvoluted by a mass spectrometer and serve as the basis for accurate quantification of each labeled protein. The objective of the initial implementation of this technology has been quantitative protein profiling. The method is based on a class of reagents called isotope coded affinity tags (ICAT reagents) (see Fig. 4) and the method is schematically illustrated (see Fig. 5). By changing the specificity of the reagent, the approach becomes generic for different quantitative proteomic measurements. Work is underway to extend this approach to determine profiles of enzyme activities (an area pioneered by Ben Cravatt from the Scripps Research Institute), and to protein linkage analysis, and protein phosphorylation profiles."
|
|
|
In collaboration with Sciex (a manufacturer of mass spectrometers) Dr. Aebersolds group has developed a mass spectrometry system that he refers to as "smart data acquisition." The system is based on a matrix-assisted laser desorption ionization (MALDI) quadrupole time-of flight mass spectrometer (QSTAR, Sciex) and is illustrated (see Figure 6). This system allows one to quantify all the detected peptides first and then to selectively sequence only those that show an interesting quantitative change (11). "By focusing the sequencing efforts on those peptides that show a change in quantity, the analysis is focused on those peptides that are relevant to the question asked, and the number of required sequencing operations is reduced by approximately an order of magnitude," stated Dr. Aebersold.
|
| PROTEOMICS AND THE PROBLEM OF FUNCTION |
|---|
|
|
|---|
The function of a protein can be defined in many different ways depending on the experiments being done and the questions being asked. It may be useful to preface the word "function" with an adjective that specifies the nature of the effect that the protein produces. "Chemical function" refers to the general type of reaction catalyzed in the case of an enzyme; for non-enzymes this term is not applicable. "Biochemical function" refers to the specific substrates used, the products produced and the mechanism of the transformation between them in the case of an enzyme; the specific molecules bound, and the response produced in the case of a receptor, scaffold, regulatory protein or channel, and so forth. "Cellular function" refers to the pathway(s) in which the protein operates. These pathways are created by the combined biochemical functions of the proteins involved. Note that there exists a hierarchy of functions in which a complete understanding of function at each level depends on the information from the previous levels. The hierarchy continues with functions defined at the level of a phenotype of, say, a knockout: this may be manifest in the effect on a single cell or on an organelle or entire organism. Finally, it is possible to define function at the level of the effect of the loss or mutation of that protein on the development of a higher organism from embryo through to adult.
Function is not a fixed property for many if not most proteins. There are many ways that gene products can be altered to elicit modified or completely new functions. For example, there exist
These modifications can modulate biochemical function either directly or indirectly by altering the pathway in which a gene product operates. Cellular function can be changed similarly. Cellular function, and in some cases even biochemical function can also be changed simply by changing the location where the protein is found in the cell or by binding it to another protein or small molecule. A proteomics study that aims at understanding function is incomplete without taking these aspects into account.
One breakout session addressed "Metabolic Pathways and Post-Translational Modifications," which defined "function" to the group participants, as reported by Edward Dennis, University of California, San Diego, and Eugene Bruce, National Science Foundation. It was noted that most speakers had emphasized inventorying, categorizing, high-throughput screening, methods, and qualities such as completeness in defining proteomics. While these were the goals of the genomics revolution, they should not be the goals of proteomics, stated some of the participants. In contrast with genomics, which is finite in scope, proteomics, especially when function is included, is essentially without limit. "Whatever is done, completeness will be very difficult, if not impossible, to achieve from the viewpoint of function in proteomics," remarked session co-chair Edward Dennis. "Proteomics is many orders of magnitude more complex than genomics. It has been suggested that there are about 300,000 human proteins, thinking only about splice variants and post-translational modifications."
The list goes on and on when trying to get ones hands around the number of discrete proteins that exist. Thus, instead of trying to count proteins some researchers suggested focusing on the life cycle of a protein. During its lifetime a protein undergoes phases of translation, maturation, regulation, and termination. Each of these phases involves numerous discrete proteins that interact with each other, and each phase involves protein modifications; so a given protein exists in an enormous number of discrete compositions and complexes, as stated previously in the report. There also exist many states imposed by protein:protein interactions, which change the nature of a protein, by ligands including a variety of metal ions, by activators, inhibitors, inducers, and this list goes on and on. There are even non-enzymatic modifications like oxidation that occur; so a given protein may exist in the cell in different redox states. Thus, this is really a combinatorial problem, explained Dr. Dennis, with both transient, and one might call them somewhat permanent, changes that occur to the protein as it undergoes its life cycle. Structural changes in the protein conformation can also lead to the development of a disease state. Prion diseases are clear examples where modifications in the structural conformation of a benign protein can lead to changes in normal function. An understanding of all the structural and functional states for even a single gene product is a huge, complex task, but one that must be considered when annotating proteins.
A number of conclusions arise from these considerations. The first is that a complete description of the function of any gene product must include aspects of both spatial and temporal changes in the protein, including changes of state. Gerald Carlson from the University of Missouri, Kansas City, suggested that we are most interested in the steady-state proteins that exist at some point in metabolism, but we are also interested in looking at all other states on the way to and after steady state. To begin to handle proteomics conceptually one must integrate the experimental results with the enormous amount of data on the computational side, and it is a huge undertaking to even begin to figure out how to relate all those states that are so important.
An interesting example of a chemical function genomics program was given by Thomas Leyh, co-chair of the "Structure Function" breakout session, who outlined an initiative intended to provide a functional genomics counterpart to the structural initiative already under way. The core of this multifaceted program, the subject of a recent National Institutes of Health workshop, is to perform large-scale mutagenesis and protein functional studies to create a database that assigns catalytic, ligand-binding, or other functions to the highly conserved, non-structural core residues for every protein family. A compendium of molecular function annotation will be propagated across relevant databases to establish links and assign molecular function to specific biological phenomena. While the design of the program tightly couples it to the structural genomics initiative (whose mission is to provide a representative structure for each protein family) it also includes interfaces with programs dedicated to the identification of protein function and the development of bioinformatic recognition algorithms, among others. Such a compendium would be extremely valuable to the biochemical and other scientific communities, and the program would establish the classical structure-function equation on a genomic scale.
Proteomics is far more complex than a simple profiling of the protein content of a cell, even with potential modifications of the proteins and protein:protein interactions included. Profiling of gene expression or protein expression is a useful tool but in most instances gives little direct information about biochemical function, although sometimes cellular functions do emerge. Among other problems with these approaches the correlation between mRNA levels and protein levels is poor for all but the most highly expressed genes. The view of function presented here makes this complexity apparent. A final point was that the field needs more emphasis on what a protein does, not just which proteins exist under what conditions.
Bioinformatics
There are two ways that function is being determined at this time on a genome-wide scale. One is essentially bioinformatics driven and the other uses structural information. Bioinformatics involves, among other things, sequence comparisons and structure comparisons. These can be carried out on a genome-wide scale, as are comparisons of profiles of gene expression. Proteomics, as it is currently implemented in most instances, is geared towards comparisons of datasets of profiles of protein expression, usually determined by mass spectrometry.
Sequence comparison can be powerful especially if families of related sequences are identified. However, it is becoming apparent that not only can function diverge markedly when two sequences differ by 50 percent or more, in some instances sequences that are more than 90 percent identical code for proteins that operate on completely different substrates and have no cross-reactivity. Assignment of biochemical function from sequence data alone should always be regarded as tentative without confirmatory experimental evidence. Most functional annotation errors in genomics databases probably arise this way.
Structural Proteomics
Among the possible experimental ways of approaching the problem of function determination on a large scale, the one that has received the most emphasis thus far is the use of structural information. Predicated on the assumption that the three-dimensional structure of a protein will often provide information about its biochemical and cellular functions, the structural approach is being applied on a genome-wide scale in a number of independent initiatives. Although in many instances at least the chemical function of an enzyme can be guessed from its overall fold, even that deduction is often problematic, and assignment of higher levels of function is practically impossible without additional information. This problem is exacerbated when membrane-associated proteins are considered. Between 2540 percent of the proteins in the cell are estimated to be membrane associated (depending on the organism). The database of membrane protein structures is very small and the methods for determining those structures are very difficult and uncertain.
Cheryl Arrowsmith, a structural biologist from the Ontario Center for Structural Proteomics at the University of Toronto, discussed her groups research on structural proteomics. She emphasized the difference of structural proteomics from structural genomics because they work on proteins, not genes. The focus of her proteomics research is to use X-ray crystallography and NMR spectroscopy to determine the three-dimensional structures of proteins on a genome-wide scale. She is particularly interested in examining the extent to which protein structure can reveal protein function. The model system used is Methanobacterium thermoautotrophicum, whose sequence was completed at the time the project was initiated in 1998. Since that time, her laboratory has evaluated thousands of proteins by subcloning into bacterial expression systems, performing either NMR studies or X-ray diffraction on soluble and relatively clean purified protein. They have also evaluated hundreds of proteins from a number of different bacterial, viral, and yeast genomes. However, the number of proteins that give structural samples was low. "There is a huge attrition rate in going from cloned genes to those that can be readily expressed in bacteria, are soluble in bacteria, can be purified, give good crystals or promising NMR spectra, and these would be very good in terms of getting a structure." The attrition rate overall is about 8595 percent of genes that are tried, in other words, approximately 515 percent of bacterial or archaebacterial genes can be processed straight through to three-dimensional structures using a single protocol (e.g., single expression conditions, single purification procedure), according to Dr. Arrowsmith (12). The numbers are worse for eukaryotic systems. "Clearly one needs to try multiple procedures for protein expression, purification, and crystallization in order to improve the success rate for structures," said Dr. Arrowsmith.
She has confirmed these difficulties in a number of other species and systems, and she reported that many of the other National Institutes of Health centers participating in the project are seeing these sorts of statistics as well. Only in a few cases have they had the opportunity and actually gone on to do functional studies of these proteins. Even with proteins of known function, such as spermidine synthase, the determination of structure can be useful in proposing an atomic model and thus a better understanding of the mechanism of enzymatic function. Dr. Arrowsmiths group was among the first to solve the structure of this protein. There are thousands of clones and proteins that have been prepared in the Ontario Center for Structural Proteomics and in many of the other centers; and these clones are available for further functional analysis. "I think this is a huge resource that is being generated, and it should be exploited through projects that emphasize [biochemical] functional analysis of proteins," said Dr. Arrowsmith.
Cellular Function
Protein location can be determined by such genome-wide techniques as green fluorescent protein (GFP) tagging, and protein:protein interactions can be determined by affinity chromatography, immunoprecipitation, and yeast two-hybrid experiments. Databases resulting from these methods are beginning to emerge, but they are of uncertain accuracy. Recent comparisons of independently obtained databases for yeast proteins suggest that location determination is fairly robust but protein:protein interactions are at best determined with less than 50 percent overall accuracy. Clearly more reliable methods are needed, and efforts to create protein chips for profiling of interactions with proteins and small molecules appear promising.
One useful addition to the available arsenal of function-finding tools would be a database of three-dimensional motifs of biochemical function. Such a database would contain those structural elements that participate in ligand binding and catalysis for proteins of known function. This database could be searched in a manner similar to sequence database searches whenever a new protein structure is determined. Another useful tool would be, for each protein family, a database of mutations with functional characterization. Essentially this database would provide a link between a mutation at a particular site, a genetic lesion, a metabolic lesion and even a phenotype such as a disease.
Once again it was stressed that proteomics should be considered as a much broader field than would be apparent from early efforts, which have focused on cataloging levels of protein expression. Ideally it should encompass efforts to obtain complete functional descriptions for the gene products in a cell or organism. Because of the complexity of functional description, clearly more than one technique is required and no one existing technique should be emphasized in preference to any others. This goal may be beyond the reach of existing technologies, even for small numbers of proteins, but it is the direction in which the field must go.
<