A Gene-centric Human Proteome Project

The success of the Human Genome Project (1, 2) has provided a blueprint for the gene-encoded proteins potentially active in all of the hundreds of cell types that make up the human body. Yet we still have limited knowledge regarding a majority of the approximately 20,000 protein-coding human genes discovered through the genome project (3, 4). At present, about 8,000 (38%) of these genes lack experimental evidence at the protein level (UniProt), and for many others there is very little information related to protein abundance, distribution, subcellular localization, and function. The diagnostic, prognostic, and therapeutic value of understanding human biology at the protein level argues for a systematic effort to map the human proteome. The proteomic space generated from these gene products is enormous, including up to a million different protein molecules derived by combinatorial recombination of DNA (immunoglobulins and T-cell receptors), alternative splicing of RNAs, and numerous protein modifications of various types that vary with time and with physiological, pathological, and pharmacological perturbations. Hochstrasser (5) therefore recently argued for a protein-centric human proteome project, driven by mass spectrometry technology focusing on the protein perturbations caused by human diseases. Our goal is to define clear endpoints of a Human Proteome Project, combining the strengths of complementary technology platforms. We therefore propose a gene-centric approach to generate a human proteome map with an “information backbone” about the proteins expressed from each gene locus and to make this information publicly available with no restrictions, as was done with the genome sequence data, thereby facilitating in-depth studies to understand human biology and diseases. With further analogy with the genome project, the gene-centric human proteome map can be complemented with in-depth studies on protein variability with relevance to life stages and various diseases. Reasonable end points of such a Human Proteome Project would be feasible within a limited time period and achievable without major paradigm shifts in technology. Taking into account recent major advances in mass spectrometry (6) and immunobased methods (7), we propose a systematic threepart approach to ensure that, for each predicted proteincoding gene, at least one of its major representative proteins will be characterized in the context of its major anatomical sites of expression, its abundance, and its interacting protein partners:


HUPO-THE HUMAN PROTEOME ORGANIZATION
The success of the Human Genome Project (1, 2) has provided a blueprint for the gene-encoded proteins potentially active in all of the hundreds of cell types that make up the human body. Yet we still have limited knowledge regarding a majority of the approximately 20,000 protein-coding human genes discovered through the genome project (3,4). At present, about 8,000 (38%) of these genes lack experimental evidence at the protein level (UniProt), and for many others there is very little information related to protein abundance, distribution, subcellular localization, and function.
The diagnostic, prognostic, and therapeutic value of understanding human biology at the protein level argues for a systematic effort to map the human proteome. The proteomic space generated from these gene products is enormous, including up to a million different protein molecules derived by combinatorial recombination of DNA (immunoglobulins and T-cell receptors), alternative splicing of RNAs, and numerous protein modifications of various types that vary with time and with physiological, pathological, and pharmacological perturbations. Hochstrasser (5) therefore recently argued for a protein-centric human proteome project, driven by mass spectrometry technology focusing on the protein perturbations caused by human diseases. Our goal is to define clear endpoints of a Human Proteome Project, combining the strengths of complementary technology platforms.
We therefore propose a gene-centric approach to generate a human proteome map with an "information backbone" about the proteins expressed from each gene locus and to make this information publicly available with no restrictions, as was done with the genome sequence data, thereby facilitating in-depth studies to understand human biology and diseases. With further analogy with the genome project, the gene-centric human proteome map can be complemented with in-depth studies on protein variability with relevance to life stages and various diseases.
Reasonable end points of such a Human Proteome Project would be feasible within a limited time period and achievable without major paradigm shifts in technology. Taking into account recent major advances in mass spectrometry (6) and immunobased methods (7), we propose a systematic threepart approach to ensure that, for each predicted proteincoding gene, at least one of its major representative proteins will be characterized in the context of its major anatomical sites of expression, its abundance, and its interacting protein partners: 1. Protein parts list: the identification and characterization of at least one representative protein from every human gene with its abundance and major modifications. This would define the backbone of a human proteome encyclopedia. 2. Protein distribution atlas: determination of protein profiles of at least one representative protein from every human gene in all major normal tissues and organs with single cell resolution as has been initiated (7). This effort already presents immunohistochemical characterization for 6,800 proteins corresponding to approximately one-third of the protein-encoded genes. The atlas also should include a subcellular localization map in which the relative distribution of the protein in various organelles and other subcellular structures is determined in selected human cell lines. 3. Protein pathway and network map: initial characterization of the transient and stable interactions and complexes of human proteins with other proteins (8) that contribute to cellular protein pathways. Networks of interacting proteins and pathways will be essential to describe biological functions at the molecular level, to understand disease processes, and to generate therapeutic targets; over time the interactions could be expanded to nucleic acids, lipids, and other molecules.
The protein profiling should ideally be performed using complementary mass spectrometry and protein capture technology platforms with proper standardization (9, 10) to allow comparative studies and with emphasis on quantification to enable systems biology analyses. In addition, knowledge from model organisms can be used to complement studies of humans as has been so productive for the genome. Complementary technology platforms, such as mRNA profiling, gene knock-outs, short interfering RNA silencing, green fluorescent protein fusions, and gene tagging, can be used with bioinformatics to validate and integrate the results. At least in the first phase, the various efforts can be pursued in a federated, decentralized manner involving many leading research groups in different regions of the world with coordination to avoid redundancy and to ensure standardization and completeness. In this way, each region can direct its funding to projects of relevance for its interests and needs. An important task will be to integrate the enormous data flow from these analyses; implementation of standard data exchange formats is essential (11).
In summary, we propose a coordinated, international effort to map the protein complement of the human genome. The effort will deliver a publicly available resource of protein profiling data, protein-specific reagents, and cDNA clones covering all the protein-coding genes of the human genome. With the aid of current and emerging advances in a well defined set of com-plementary technology platforms, we envision a genome-wide human protein resource comprising the proteome parts list, the protein distribution atlas, and maps of protein molecular pathways, interactions, and networks. This approach provides a foundation for disease-oriented protein-centric studies by integrating information from gene and protein function with clearly defined end points. The output will be a major step to define the proteomic landscape in the human body, support the discovery of new diagnostic and therapeutic tools, and stimulate biological and medical research.
Call for inputs: HUPO has organized a small international working group to develop a detailed plan for the vision, goals, deliverables, timetable, launch, and funding of this Human Proteome Project. Open discussions are planned for regional HUPO meetings around the world, leading up to a roll-out at the HUPO World Congress in Sydney 19 -23 September, 2010. Comments and questions from the proteomics community are welcome through a Contact us link at the HUPO website, www.hupo.org.