A Human Proteome Detection and Quantitation Project*

The lack of sensitive, specific, multiplexable assays for most human proteins is the major technical barrier impeding development of candidate biomarkers into clinically useful tests. Recent progress in mass spectrometry-based assays for proteotypic peptides, particularly those with specific affinity peptide enrichment, offers a systematic and economical path to comprehensive quantitative coverage of the human proteome. A complete suite of assays, e.g. two peptides from the protein product of each of the ∼20,500 human genes (here termed the human Proteome Detection and Quantitation project), would enable rapid and systematic verification of candidate biomarkers and lay a quantitative foundation for subsequent efforts to define the larger universe of splice variants, post-translational modifications, protein-protein interactions, and tissue localization.

There is growing interest in the idea of a comprehensive Human Proteome Project (1) to exploit and extend the successful effort to sequence the human genome. Major challenges in defining a comprehensive Human Proteome Project (and distinguishing it from the genome effort) are 1) the potentially very large number of proteins with modified forms; 2) the diversity of technology platforms involved in their study; 3) the variety of overlapping biological "units" into which the proteome might be divided for organized conquest; and 4) sensitivity limitations in detecting proteins present in trace amounts. The process of analyzing and discussing these issues may (and ought to) be lengthy, as it addresses core scientific unknowns as well as decisions about the organization and scale of biomedical research in the future. The ben-efits of taking time to involve the entire biological research community, and especially the medical research segment, in these discussions are substantial.
Progress in systematically measuring proteins, however, need not wait for the conclusion of such discussions. We propose a near-term tactical approach, called the human Proteome Detection and Quantitation (hPDQ) 1 project that will enable measurement of the human proteome in a way that would yield immediately useful results while the strategy for a comprehensive Human Proteome Project is worked out. The hPDQ project is aimed at overcoming present difficulties in answering basic biological questions about the relationship between protein abundance (or concentration) and gene expression, phenotype, disease, and treatment response; i.e., the growing field of protein biomarkers. It is thus focused on the study of biological variation affecting protein expression rather than study of structure and mechanism and in this initial form does not directly address splice variants or most posttranslational modifications. It is aimed at providing immediately useful capabilities to the human biology research community, in a way that does not adversely impact funding for individual investigators and does not generate administrative constraints on their ability to set and change courses in the conduct of research. Specifically, the goal of the hPDQ is to enable individual biological researchers to measure defined collections of human proteins in biological samples with 1 ng/ml sensitivity and absolute specificity, at throughput and cost levels that permit the study of meaningfully large biological populations (ϳ500 -5,000 samples).
We clearly do not have this capability today. If an investigator defines a set of 20 proteins hypothesized to change in relation to some biological process or event, assays for only a minority (often none!) will typically be available. Further, these assays will lack absolute specificity and will not easily be multiplexed. Current proteomics research platforms are focused mainly on discovery; providing increasingly broad protein sampling surveys, generally at low throughput and high cost. Such approaches generally do not yield an economical or accurate measurement of a defined set of proteins in every sample. There is thus a fundamental barrier to hypothesis testing in quantitative proteomics, where relationships between protein abundance and biology are sought. A particularly important instance of this limitation occurs in the effort to establish useful biomarkers of disease, for diagnosis, for measuring efficacy of treatment, and for monitoring of disease recurrence. This limitation is largely responsible for the research community's failure in recent years to bring forward significant numbers of new proteins as Food and Drug Administration approved diagnostic tests (2). However, if a robust, economical, and widely diffused capability to measure all human proteins existed, the research community would have the collective means to assess the utility of all human proteins as biomarkers in hundreds of diseases and other processes in the most efficient way.
The need for new or improved biomarkers in many areas of healthcare has become critical. Early detection of cancer, coupled with surgical intervention, has the potential to radically improve survival (3), provided early markers exist and can be found. Without good biomarkers, degenerative diseases such as Alzheimer and chronic obstructive pulmonary disease (COPD) are difficult to detect early enough to benefit from the potential therapies. Clinical development of new drugs increasingly depends on identification of biomarkers for pharmacodynamic assessment of drug action to help guide dose and schedule, and predictive biomarkers for selection of patients who will benefit from therapy (4). Companion diagnostics are the currency of personalized medicine and represent those predictive or response biomarkers that are linked to specific therapeutics, substantially increasing their clinical value. Surrogate biomarkers (those biomarkers that substitute for a clinical outcome or response) are the most difficult to discover and to verify because of the long timeframe required but can radically shorten appropriate clinical trials. The impact of a vigorous increase in clinical biomarkers could thus be enormous, both in terms of patient well being and financial viability of healthcare systems worldwide.
Protein measurements are also likely to play an important role in assessing the quality of material stored in large clinical sample collections (Biobanks). Much discussion has occurred recently regarding the value of banked samples because of unknown degrees of protein degradation occurring during acquisition, processing, and storage. This matter is of acute concern in the case of serum, where coagulation initiates a plethora of proteolytic cleavage events. The hPDQ may provide the opportunity to determine the value of each sample through the development of prototypic peptides tracking the stability of labile proteins.
An attractive technology for achieving the objective of hPDQ is quantitative mass spectrometry, the sensitivity, and specificity of which are well established in the measurement of small molecules (5, 6) and peptides (7,8). To achieve comprehensive quantitation of proteins, given the immense variability in their physical properties, these larger molecules are digested to component peptides using an enzyme such as trypsin, and protein amount is measured using proteotypic peptides (9, 10) as specific stoichiometric surrogates. Multiple peptides from a target protein provide independent confirmation of this stoichiometry (equivalent to having multiple enzyme-linked immunosorbent assays with different antibody pairs), serving to control for the possibility of incomplete digestion or subsequent losses. Accurate calibration is achieved by spiking digested samples with known quantities of synthetic stable-isotope labeled peptides as internal standards (11,12). The sensitivity of this approach for multiplexed analysis of proteins in plasma has been extended from the microgram (13) to nanogram/ml levels by depletion of abundant proteins and limited peptide fractionation prior to analysis (14) or by capture of the subset of glycopeptides (15). Sensitivity and throughput of peptide MS measurements can be further increased to levels required in hPDQ by specific enrichment of the target peptides using anti-peptide antibodies. This method, called SISCAPA (for "stable isotope standards and capture by anti-peptide antibodies") (16) or iMALDI (for immuno-MALDI) (17), combines the enhanced sensitivity of immunoassays with the specificity of mass spectrometry, while maintaining multiplexing capability. For these reasons we emphasize SISCAPA and iMALDI in this hPDQ proposal, although proteins in the 100 ng/ml or higher concentration are readily accessible by targeted MS in plasma without antibody enrichment. Combining these elements results in a measurement system, with the potential to measure 10 -100 selected proteins at ng/ml levels in small (ϳ10 l) samples of human plasma in a single short analytical run. Sensitivity can be further increased through the use of larger samples and/or advances in MS sensitivity. In comparison to the conventional ELISA approach, MS-based SISCAPA assays are less expensive to develop (one antibody instead of a carefully matched pair), easier to multiplex (off-target interactions being less likely with peptides than proteins), and provide absolute structural specificity (by reading the masses of multiple specific peptide fragments). This improved specificity solves a major problem plaguing clinical immunoassays for proteins such as thyroglobulin (18) and has led to the development of first clinical SISCAPA assay (19). In addition, since the mass spectrometer functions as a "second antibody" that identifies the captured peptides, the anti-peptide antibody used for peptide enrichment need not have perfect specificity. This greatly reduces the cost of affinity reagents, currently a limiting factor in developing ELISA assays for large numbers of protein analytes.
Achieving the hPDQ goal by this approach would require that four resources be generally available. 1) A comprehensive database of proteotypic (protein-unique) peptides for each of the 21,500 human proteins (20), coupled with experimental or computational data identifying the best peptides for MS measurement and associated optimized MS instrument parameters. 2) At least two synthetic proteotypic peptides, la-beled with stable isotope(s) and available in accurately quantitated aliquots, for use as internal measurement standards for quantitation of each protein. Such peptides are readily available today through custom order, at rapidly declining prices.
3) Anti-peptide antibodies specific for the same two proteotypic peptides per target protein, capable of binding the peptides with dissociation constants Ͻ 1e-9 (the level required in theory and practice to enrich low-abundance peptides from complex sample digests). Such antibodies are now being made for a variety of targets, and a robust production pipeline is being developed. Monoclonal antibodies would be preferred, despite their higher development cost, to establish a stable reagent supply, especially for those targets that prove useful as biomarkers. 4) Robust and affordable instrument platforms for quantitative analysis of small (amol to fmol) amounts of tryptic peptides and for sample preparation. Existing triple-quadrupole mass spectrometers (with a current worldwide installed base of more than 6,000 instruments) coupled with nanoflow (ϳ300 -600 nl/min) liquid chromatography systems can meet this requirement and are undergoing rapid improvement with declining cost. MALDI platforms may provide similar capabilities at even higher throughput.
We estimate that an initial pilot phase targeting 2,000 proteins selected for biomarker potential could be completed in two years at a cost of less than $50 million through funding of existing academic and commercial resources in a distributed network. In the following five years, the remaining 18,500 proteins could be targeted for $250 million, making use of anticipated technical improvements, particularly in the strategies for generating suitable high affinity monoclonal antibodies (21) in large numbers at low cost (22).
Although the natural mechanism for providing the hPDQ database (resource 1 above) is through an academic collaboration, perhaps modeled on the successful Global Protein Machine (23) and Peptide Atlas (24) databases, the other resources would benefit from commercial distribution by experienced providers of instruments and reagents. The required instrument platforms (4 above) serve existing markets, and their further development is unlikely to require additional funding for hPDQ applications. However, business economics does not presently justify the expense of developing well characterized antibodies and peptides for quantitation of proteins that are not already recognized as pivotal in biological research (i.e. precisely those in need of the attention of the research community). Hence a substantial portion of the required funding for the proposed approach for such antibody and peptide reagents will be needed from government and philanthropic sources. A significant advantage of such diversified support would be the leverage it would provide in retaining in the public domain the identities of the selected peptides, their parameters and basic measurement protocols.
The value of a general protein measurement capability for research is very substantial, but the proposed effort would not solve several larger issues that must await definition of a broader human proteome program. For example, the hPDQ project does not address the basic process of de novo proteome-wide discovery; the comprehensive exploration of splice forms, post-translational modifications, active fragments of preproteins or genetic variants (although once known, most of these can be targeted by the methods used here); interactions among proteins or with other molecules; or spatial arrangement of proteins in organs and tissues. Each of these areas would benefit from the resources proposed in hPDQ, but will likely require separate, coordinated large-scale efforts that are likely to identify additional sets of biomarkers. Thus although a complete suite of targeted assays is only a first step toward the complete human proteome, we feel that its fundamental importance for progress in biomarker research and its value as a foundation for protein quantitation justifies consideration as an initial step.
In the beginning of the study of protein diagnostics, investigators at the Behring Institute discovered many of the well known plasma proteins and made associated specific antibodies and antibody-based quantitative tests available to the research community worldwide, spurring the initial round of plasma biomarker research. The application of monoclonal antibodies sparked additional discoveries through close coupling of protein "discovery" with simple quantitative monoclonal antibody-based assays -this "shortcut" to clinical measurement allowed investigators to publish more than 1,000 papers referring to the ovarian cancer marker CA125 (measured by ELISA) before the sequence of the protein was finally identified in 2001 (25). The broader proteomics technologies (beginning with the two-dimensional electrophoresis technology that formed the basis of the Human Protein Index Project (26) formulated by two of us almost 30 years ago, and extending to modern shotgun-style MS-based approaches) have radically expanded the universe of observable proteins. However, quantitative specific assay capabilities have not kept pace with this expansion, leading to the current gap between biomarker proteomics and clinical biomarker output. It is now time to address this gap and realize the benefits of a clinically accessible human proteome. Effective translation of basic research into tangible medical benefit requires it.