The Human Plasma Proteome

We have merged four different views of the human plasma proteome, based on different methodologies, into a single nonredundant list of 1175 distinct gene products. The methodologies used were 1) literature search for proteins reported to occur in plasma or serum; 2) multidimensional chromatography of proteins followed by two-dimensional electrophoresis and mass spectroscopy (MS) identification of resolved proteins; 3) tryptic digestion and multidimensional chromatography of peptides followed by MS identification; and 4) tryptic digestion and multidimensional chromatography of peptides from low-molecular-mass plasma components followed by MS identification. Of 1,175 nonredundant gene products, 195 were included in more than one of the four input datasets. Only 46 appeared in all four. Predictions of signal sequence and transmembrane domain occurrence, as well as Genome Ontology annotation assignments, allowed characterization of the nonredundant list and comparison of the data sources. The “nonproteomic” literature (468 input proteins) is strongly biased toward signal sequence-containing extracellular proteins, while the three proteomics methods showed a much higher representation of cellular proteins, including nuclear, cytoplasmic, and kinesin complex proteins. Cytokines and protein hormones were almost completely absent from the proteomics data (presumably due to low abundance), while categories like DNA-binding proteins were almost entirely absent from the literature data (perhaps unexpected and therefore not sought). Most major categories of proteins in the human proteome are represented in plasma, with the distribution at successively deeper layers shifting from mostly extracellular to a distribution more like the whole (primarily cellular) proteome. The resulting nonredundant list confirms the presence of a number of interesting candidate marker proteins in plasma and serum.

The human plasma proteome is likely to contain most, if not all, human proteins, as well as proteins derived from some viruses, bacteria, and fungi. Many of the human proteins, introduced by low-level tissue leakage, ought to be present at very low concentrations (Ͻ Ͻpg/ml), while others, such as albumin, are present in very large amounts (Ͼ Ͼmg/ml). Numerous post-translationally modified forms of each protein are likely to be present, along with literally millions of distinct clonal immunoglobulin (Ig) 1 sequences. This complexity and enormous dynamic range make plasma the most difficult specimen to be dealt with by proteomics (1).
At the same time, plasma is the most generally informative proteome from a medical viewpoint. Almost all cells in the body communicate with plasma directly or through extracellular or cerebrospinal fluids, and many release at least part of their contents into plasma upon damage or death. Some medical conditions, such as myocardial infarction, are officially defined based on the increase of a specific protein in the plasma (e.g. cardiac troponin-T), and it is difficult to argue convincingly that there is any disease state that does not produce some specific pattern of protein change in the body's working fluid. This immense diagnostic potential has spurred a rapid acceleration in the search for protein disease markers by a wide variety of proteomics strategies.
Current methods of proteomics are only beginning to catalog the contents of plasma. Two-dimensional electrophoresis was able to resolve 40 distinct plasma proteins in 1976 (2), but, because of the dynamic range problem, this number had only grown to 60 in 1992 (3) and is substantially unchanged today, a quarter century later. It is now clear that more than two dimensions of conventional resolution are required to progress beyond this point. Recently, several truly multidimensional survey efforts have been mounted, with the result that the number of distinct proteins detected has increased dramatically. Additional dimensions of separation can be introduced at any of three levels: a) separation of intact proteins, either by specific binding (e.g. subtraction of defined high-abundance proteins) or continuous resolution (e.g. electrophoresis or chromatography); b) separation of peptides derived from plasma proteins, either by specific binding (e.g. capture by anti-peptide antibodies) or continuous resolution (e.g. chromatography); and c) separation of peptides, and particularly their fragments, by mass spectrometry (MS). Many possible combinations of these dimensions can be implemented, the only limitations being the effort, cost, and time of analyzing many fractions or runs instead of one.
In this article, we have compared and combined data from three different multi-dimensional strategies with data from a fourth, classical source (the protein biochemistry and clinical chemistry literature) to provide a meta-level overview of both the contents and the rate of discovery of new components in plasma. The three experimental datasets are derived from 1) whole protein separation by a three-dimensional process (immunosubtraction/ion exchange/size exclusion) followed by two-dimensional electrophoresis (2DE) followed by MS identification of resolved spots (4); 2) Ig subtraction followed by trypsin digestion followed by two-dimensional liquid chromatography (LC) (ion exchange/reversed phase) followed by tandem MS (MS/MS) (5); and 3) molecular mass fractionation, followed by trypsin digestion followed by two-dimensional LC (cation exchange/reversed phase) followed by MS/MS (6). These three experimental approaches have two features in common (the removal of most Igs, by specific subtraction or size, and the use of MS for molecular identification) but otherwise they span the gamut of proteomics discovery approaches: separation at the protein level, separation at the tryptic peptide level, and a hybrid.
Combining experimental data with literature search results on proteins detected in plasma (representing a large body of accumulated "nonproteomics" data) should provide a broad perspective on plasma contents. Because the same proteins detected by various methods can be referred to by different names or accession numbers, we have used a sequencebased approach to eliminate redundancy and cluster all occurrences of the same protein. The resulting list makes it possible to examine the overlap between the various approaches and to see whether they are biased toward particular classes of proteins. In addition, a pooled nonredundant list should provide a relatively unbiased survey of the kinds of proteins present in plasma, which could have important diagnostic implications. Finally, a large list of proteins actually observed in plasma paves the way for top-down, targeted proteomics approaches to the discovery of disease markers: the development of accurate high-throughput specific assays for selected candidates from this list, as a supplement to the use of single methods for marker discovery in small sample sets. In the longer term, proteins with strong, mechanistic disease relationships may be viable therapeutic candidates as well.

Lit: Literature Search
Manual Medline searches were performed searching for titles or abstracts containing human plasma or serum proteins, excluding articles on membranes, stimulation, drug, and dose. A total of 468 entries were collected, of which 458 had a human sequence accession number in one or more of the major databases.

2DEMS: Separation of Serum Proteins (LC 3 /2-DE) ϩ MS/MS Identification
Intact proteins were fractionated by chromatography and 2DE and identified by MS, generating the dataset described by Pieper et al. (7). Briefly, human blood sera were obtained in equal volumes from two healthy male donors (ages 40 and 80). Albumin, haptoglobin, transferrin, transthyretin, ␣-1-anti trypsin, ␣-1-acid glycoprotein, hemopexin, and ␣-2-macroglobulin were removed by immunoaffinity chromatography. The immunoaffinity-subtracted serum concentrate was fractionated further by sequential anion exchange and size exclusion chromatography. The resulting 66 samples were individually subjected to 2DE. All visible Coomassie Blue R250 spots were cut out, destained, reduced, alkylated, and digested with trypsin. All extracted peptides were analyzed by matrix-assisted laser desorption/ionization time-of-flight (MALDI-TOF) on a Bruker Biflex or Autoflex mass spectrometer (Bruker, Billerica, MA) and searched against Swiss-Prot. Those samples that did not give positive identification by MALDI-TOF where subjected to LC-MS/MS analysis by ion trap (IT) MS (Thermo Finnegan LCQ, Woburn, MA) and searched against the National Center for Biotechnology Information (NCBI) database using SEQUEST.

LCMS1: Separation of Peptide Digests of Serum Minus Ig (LC 2 ) ϩ MS/MS Identification
A published dataset prepared by Adkins et al., (5) was used. Briefly, human blood serum was obtained from a healthy anonymous female donor. Igs were depleted by affinity adsorption chromatography using protein A/G. The resulting Ig-depleted plasma was digested with trypsin and separated by strong cation exchange on a polysulfoethyl A column followed by reverse-phase separation on a capillary C18 column. The capillary column was interfaced to an IT-MS (Thermo Finnigan LCQ Deca XP) using electrospray ionization. The IT-MS was configured to perform MS/MS scans on the three most intense precursor masses from a single MS scan. All samples were measured over a mass/charge (m/z) range of 400 -2,000, with fractions containing high complexity being measured with segmented m/z ranges. Tandem mass spectra were analyzed by SEQUEST as described using the NCBI May 2002 database.

LCMS2: Separation of Peptide Digests of Low-Molecular-Mass Serum Proteins (LC 2 ) ϩ MS/MS Identification
The fourth dataset is that described by Tirumalai et al. (6), focused on the lower-molecular-mass plasma proteome. Briefly standard human serum was purchased from the National Institute of Standards and Technology. High-molecular-mass proteins were removed in the presence of acetonitrile using Centriplus centrifugal filters with a molecular mass cutoff of 30 kDa. The low-molecular-mass filtrate was reduced, alkylated, and digested with trypsin. The digested sample was fractionated by strong cation exchange chromatography on a polysulfoethyl A column. Reversed-phase LC was subsequently performed on 300A Jupiter C-18 column coupled on line to an IT-MS (Thermo Finnegan LCQ Deca XP). Each full MS scan was followed by three MS/MS scans where the three most abundant peptide molecular ions were selected. MS/MS spectra were searched against the a human protein database using SEQUEST.

Bioinformatics
Sequence Clustering-The Blastp protein comparison algorithm (8,9) was used to query the sequence of each protein identified against a database containing the aggregate sequences of all proteins identified by any method. Sequences sharing greater than 95% identity over an aligned region were grouped into "unique sequence clusters." Sequences were unmasked, and the minimum alignment length considered was 15 aa. This similarity-based approach was sufficient to group identical sequences, sequence fragments, and splice variants. Annotation in the nonredundant table was reported for the "best annotated" protein in the cluster set.
Signal Peptide Prediction-Signal peptides were predicted using the commercially available SignalP version 2.0 neural net and hidden Markov model (HMM) algorithms (10) and sigmask (11) signal masking program developed as part of Inpharmatica's Biopendium (12) protein annotation database. Each sequence received a score of ϩ1 for a statistically significant positive signal peptide prediction from any of the three algorithms. The scores 0, 1, 2, and 3 for a particular sequence were then converted to qualitative terms "no," "possible signal," "signal," or "signal confident," respectively.
Transmembrane Prediction-Transmembrane (TM) regions were predicted using the commercial version of TMHMM version 2.0 algorithm (13). The total number of TM helices predicted per sequence was reported for each protein sequence. When a predicted TM region overlapped a predicted signal sequence (as it did in 40 cases in H_Plasma_NR_v2), this was interpreted as a signal sequence only.
Structural and Sequence-based Domain Annotation-Sequences were scanned against a library of BioPendium and iPSI-BLAST (9, 11)-like protein profiles constructed from SCOP (14), PFAM (15), PRINTS (16), and PROSITE (17) domain families. Hits to these profiles were reported at a statistical e-value cut-off of 1e-5. This cut-off was chosen to maximize profile coverage and minimize the occurrence of false positives. Sequences were not masked for low complexity or coiled coils prior to profile scanning.
Gene Ontology (GO) Term Annotation-NCBI GI number accessions for the sequences were matched to their SPTR (18) equivalents based on sequences sharing Ͼ95% sequence identity over 90% of the query sequence length. GO (19) component, process, and function terms were then extracted from text-based annotation files available for download from the GO database ftp site: ftp.geneontology. org/pub/go/gene-associations/gene_association.goa_human. For graphical reporting, a series of GO terms in each category were extracted by text searching of relevant keywords (indicated by the category names on plots) through all the assigned GO definitions. A GO component summary for the whole human proteome was prepared by applying the same approach to the complete GO human database referred to above.
Database Assembly-The nonredundant (NR) plasma database was assembled as a series of tables in a PostgreSQL relational database and queried to derive summary statistics for tables and figures shown here.

Number of Distinct Proteins Detected in Plasma, and the Nature of Nonredundancy
Four sets of accession numbers for proteins occurring in plasma (468 from Lit, 319 from 2DEMS, 607 [reported as 490 nonredundant accessions] from LCMS1, and 341 from LCMS2) were combined to yield 1,735 total initial accessions (Table I). A total of 55 of the input accessions referred to nonhuman sequences, and these were not considered further in the present analysis. A very conservative method of selecting distinct proteins was used in order to avoid counting sequence variants, splice variants, or cleavage products of one gene product as different: any sequences that shared a region larger than 15 aa with greater than 95% sequence identity were assigned to the same cluster and reported as a single entry in the nonredundant set. Fig. 1 shows one result of applying these criteria, in this case resulting in the assignment of 10 initial accessions to a single cluster for haptoglobin, a major plasma protein found in all four initial datasets and whose three separate subunit types are derived from a single translation product. This case also highlights the general observation that not all datasets used the same primary accession database (NCBI GI, Swiss-Prot, or RefSeq as examples). The largest cluster (109 "redundant" entries) is accounted for by Igs, where all the Ig heavy and light chains of all types were clustered together as one entry arbitrarily chosen as S40354 (an Ig chain sequence). Thus 6.2% of the input accessions were Igs, despite the fact that each of the experimental methods included steps to remove these molecules.
This approach is more conservative (fewer distinct proteins reported) than the methods used in some of the input data sources, which accounts for the decrease in each set when intra-set redundancy is removed (1,509 human accessions remain). When inter-set redundancies are removed (making the full list nonredundant by the criteria described above), a total of 1,175 distinct proteins remain. The entire nonredundant set, here abbreviated H_Plasma_NR_v2 (H_Plas-

Protein Coverage by Data Source
Of the 1,175 nonredundant human proteins in H_Plas-ma_NR_v2, 195 entries, or 17%, were present in more than one dataset (set H_Plasma_195: Fig. 2 and Table II). Only 46 (4%) were found in all four sets of accessions (Total_ sources ϭ 4, shown in bold type in Table II). Of these only one (inter-␣ trypsin inhibitor heavy chain H1) is predicted to have even a single transmembrane domain, and only one (the hemoglobin ␤ chain presumably released from red cell lysis) is predicted not to have a signal sequence. These characteristics (presence of signal sequence and absence of transmembrane domains) are those expected for major plasma proteins secreted by organs such as the liver.
An additional 47 proteins (4%) were found in three of the four datasets (Table II). Of these 47 proteins, only three were found in all three experimental datasets but not the literature dataset. These three proteins were pigment epithelium-derived factor; a nonmuscle myosin heavy chain; and secretory, extracellular matrix protein 1. Pigment epithelium-derived factor and secretory, extracellular matrix protein 1 are both secreted proteins (Swiss-Prot annotations), but upon further searching only pigment epithelium-derived factor has been reported as plasma associated (20). Nonmuscle myosin is an intracellular protein (21) possibly derived from platelets. The remaining 43 proteins seen in three datasets were all documented plasma proteins.
A further 102 proteins (9%) were found in two datasets (Table II). Of these, 43 proteins were found in two experimental datasets but not in the literature dataset. These include a number of proteins that would not typically be thought of as likely plasma components, including a chloride channel and a copper-transporting ATPase (with 10 and 7 predicted transmembrane domains, respectively), an oxygen-regulated protein, three hypothetical proteins, and a group of likely nuclear proteins including mismatch repair protein, mitotic kinesinlike protein 1, and centromere protein F.
The remaining 980 proteins (83% of NR) were found in only one of the four input datasets. Of these, 696 proteins (71%) were found only in the experimental sets, with LCMS1, LCMS2, and Lit having similar large percentages of sourceunique proteins (70, 69, and 66%, respectively, versus 50% for 2DEMS).

Characterization of the Plasma Proteome Via Annotation Statistics
Predicted Signal Sequences-The signal sequence prediction algorithm used yielded four levels of likelihood: "no" (strong probability of no signal sequence), "possible signal," "signal," and "signal confident" (strong probability of a signal sequence) in order of increasing likelihood of a signal sequence (i.e. the number of algorithms out of the three used that predict a signal sequence). The procedure does not distinguish between the signal sequences of secreted and membrane-bound proteins (including e.g. plasma, Golgi, and mitochondrial membranes), and thus does not directly predict final protein location. Most of the 1,175 H_Plasma_NR_v2 nonredundant sequences (83%) yielded a strong positive or negative prediction (i.e. good agreement between the three prediction algorithms used), with these two results occurring in about a 2:3 ratio overall. Approximately 49% of H_Plas-ma_NR_v2 had no evidence of a signal sequence, while only about 25% of the H_Plasma_195 lacked such evidence; conversely only 34% of H_Plasma_NR_v2 gave a "signal confident" while 54% of H_Plasma_195 gave this signal. Comparing the four data sources over H_Plasma_195 (Fig. 3A), entries from all four sources were likely to have signal sequences: the Lit set had the highest bias toward "signal confident" proteins (indicated by a ratio of "signal confident" to "no" signal of 5.05), while 2DEMS showed less bias (ratio of 3.44) and LCMS1 and LCMS2 showed a higher representation of "no" signal predictions (ratios of 2.03 and 1.96, respectively). Comparing the four sources across all the 1,175 H_Plasma_NR_v2 proteins (Fig. 3B), these ratios were reduced, reflecting a greater preponderance of "no" signal proteins, but the relative differences between the sources remained, yielding ratios for Lit, 2DEMS, LCMS1, and LCMS2 of 3.85, 0.91, 0.49, and 0.44, respectively.
Predicted Transmembrane Segments-The number of predicted TM segments (0 -21) is shown for each of the four datasets ( Fig. 4) in comparison with the distribution summarizing The Human Membrane Protein Library (HMPL, cbs. umn.edu/human/info/hs.dat; Fig. 4B, line). The HMPL con-

Plasma proteins detected in at least two datasets
The table presents a nonredundant list of 195 proteins found in at least two of the four input data sources, alphabetized by protein description (containing name and synonyms).
Entries found in all four data sources are shown in bold. Lit, 2DEMS, LCMS1, and LCMS2 columns give the number of accessions in each original data set that were assigned to each NR entry. These are summed across the data sources to yield Total_accessions. Total_sources summarizes the number of sources (here ranging from 2 to 4) in which the entry was found.  contained TM segments (Fig. 4A), whereas 18% of 1,175 H_Plasma_NR_v2 proteins had TM segments (Fig. 4B). In both cases, most proteins predicted to have a TM segment had only one. The distribution of multiple TM segments generally reflected the shape obtained for HMPL, including peaks at 7 and 12 TM segments. The Lit and 2DEMS datasets contained few transmembrane proteins, whereas the LCMS methods found more, particularly at higher TM segment numbers. LCMS1 detected more proteins with TM segments than LCMS2, which concentrated on smaller proteins. The relationship between signal sequence and TM segment predictions across H_Plasma_NR_v2 is shown in Table III. Proteins with TM segments make up 11% of "no" signal sequence proteins, and 20% of "signal confident" proteins. The former group (ϩTM -Sig) shows a much higher representation of multiple TM segments than the latter (ϩTM ϩSig), including a majority of the 7-and 12-TM entries (not all extracellular proteins or domains have a "classical" signal sequence). Sigϩ proteins were much more likely (15%) than Sig-proteins (4.5%) to have a single TM segment, typical of many receptors.

Signal confident 0 Inter-␣-trypsin inhibitor heavy chain H2 precursor (ITI heavy chain H2) (inter-␣inhibitor heavy chain 2) (inter-␣-trypsin inhibitor complex component II) (serumderived hyaluronan-associated protein) (SHAP
GO Component Assignments- Fig. 5 presents a compari-son of four different subsets of the human proteome beginning with the whole human proteome and continuing through the H_Plasma_NR and H_plasma_195 sets to those 46 seen in all four of our datasets. In this progression, there is a steady increase in the proportion of extracellular proteins and a steady decrease in the proportions of membrane, mitochondrial, and nuclear proteins. Several categories appear at a higher proportion in H_Plasma_NR_v2 than in either of the smaller versions of the plasma proteome or the whole human proteome: the kinesin complex and lysosomal and cytoskeletal proteins. Fig. 6 uses the GO component assignments for H_Plas-ma_NR_v2 to further compare the four input datasets. The Lit proteins are primarily derived from extracellular (50% of the total), membrane, and cytoplasmic categories. 2DEMS accessions showed fewer extracellular and similar membrane and cytoplasmic entries, with substantial increases in a series of other categories including kinesin complex and nuclear. The two LCMS methods showed similar GO component distributions, displaying a smaller proportion of extracellular and cytoplasmic entries and more nuclear and mitochondrial entries than 2DEMS. None of the methods has a distribution close to that for the whole human proteome (Fig. 5A), as they would if the MS identifications were random.

FIG. 3. Signal sequence predictions for different data sources.
Signal sequences as predicted by three algorithms (SignalP version 2.0 neural net, HMM algorithms, and sigmask signal masking program) in the NR plasma proteome. The four outcomes (no, possible signal, signal, signal confident) correspond to 0, 1, 2, or 3 methods predicting a signal sequence. The Lit dataset is represented in black, LCMS2 is dark gray, LCMS1 is light gray, and 2DEMS is white. A, signal sequence predictions for proteins repeated in multiple datasets (H_Plasma_195). B, signal sequence predictions for nonredundant proteins (H_Plasma_NR_v2). Bar segments should be compared for relative size between sources and outcomes, but total stacked bar height is not relevant because of redundancy between the component sources. GO Function Assignments-A set of eight summary function categories were used to analyze aspects of function over H_Plasma_NR_v2 and to compare the four input datasets (Fig. 7). Proteins with cytokine and hormone activities, which are generally expected to be present at low abundance, were predominantly found in Lit only, while Lit contained almost none of the proteins with DNA-binding activity. Among the other categories reported, the experimental methods performed similarly except for underrepresentation of receptor and DNA-binding proteins in 2DEMS.
GO Process Assignments-Eight selected summary process categories were also used to analyze dataset differences over H_Plasma_NR_v2. As shown in Fig. 8, a number of major GO process categories were more evenly represented across the four datasets, though the representation of Lit proteins seems increased relative to the other sources, possibly due to more extensive process annotation of these molecules. DISCUSSION We have assembled an enlarged list of proteins observed in human plasma by combining three published experimental datasets, generated by different proteomics approaches, with a large set of proteins drawn from individual published reports on serum or plasma. Of the combined total of 1,680 human protein accession numbers, 1,175 were judged to be distinct proteins (defining set H_Plasma_NR_v2) using a very conservative set of criteria (95% sequence identity over 15 or more amino acid subsequence). As shown in Fig. 2, the overlap between the four sets is surprisingly small, with only 46 proteins occurring in all four sets, and only 195 proteins (Table II) occurring in more than one set (i.e. confirmed by identification using at least two approaches). This result suggests the involvement of one or more of several factors in limiting the overlap between different views of the plasma proteome: 1) the methods used may be different enough to expose quite different subsets of proteins (particularly because only a fraction of peptides observed are typically subjected to MS/MS and identified); 2) the samples used, though all human serum or plasma, may be different in the rank order of medium-and low-abundance protein components; or 3) identifications generated by some proteomics approaches could suffer from more or less random errors associated with mistaken MS/MS hits. We believe the first two factors are likely to account for the relatively small overlap, while the third should be a minimal influence for several reasons. First, the stringency of MS identification criteria used (described in detail in the original publications on each dataset) was reasonably high in each case, and false-positive identifications should represent less than 5-10% of the entries. 2 Second, the distribution of annotation characteristics observed over the 1,175 plasma NR set is quite different from the human proteome as a whole (see dataset results of Fig. 6 versus the whole proteome in Fig. 5A), which suggests that random human hits are not dominating the accessions. Finally, the sets all show fairly similar levels of difference between every pair (Fig. 2), indicating that none appears to be a major outlier. The observed differences between approaches represented here thus suggest that adding additional methods, or even repetition of these methods, would substantially expand the plasma proteome list from the total we obtained.
The set of 195 accessions that occur in at least two of the four datasets represents a confirmed list of targets that should be accessible for routine measurement by multiple proteomics technologies in human plasma or serum: these proteins have been detected by at least two methods in different laboratories. This set actually comprises more than 195 observable protein subunits because of the collapse of multiple forms into a single entry: for example haptoglobin's ␣ and ␤ chains collapse onto one primary gene product (cleaved after synthesis to yield the individual chains), and all Ig chains ( and light chains, and ␥, ␣, , ⑀, and ␦ heavy chains) are lumped onto one accession because of their sequence similarity. The fact that a total of 96 Ig accessions occurred in the experimental input data (14 in 2DEMS, 76 in LCMS1, and 6 in LCMS2, in addition to the 11 Ig entries in Lit) indicates that a substantial number of heterogeneous Ig sequences remain in plasma/serum samples even after the treatments used in these studies to remove antibodies prior to digestion/ fractionation.
These include adiponectin (involved in the control of fat metabolism and insulin sensitivity), atrial natriuretic factor (a potent vasoactive substance synthesized in mammalian atria and thought to play a key role in cardiovascular homeostasis), various cathepsins (D, L, S), centromere protein F (involved in chromosome segregation during mitosis), creatine kinase M chain (an abundant muscle enzyme), glial fibrilary acid protein (distinguishes astrocytes from other glial cells), psoriasin (S-100 family, highly up-regulated in psoriatic epidermis), interferon-induced viral-resistance protein MxA (confers resistance to influenza virus and vesicular stomatitis virus), melanoma-associated antigen p97 (a proposed cancer marker also expressed in multiple normal tissues), mismatch repair protein MSH2 (involved in postreplication mismatch repair, and whose defective forms are the cause of hereditary nonpolyposis colorectal cancer type 1), oxygen-regulated protein (which plays a pivotal role in cytoprotective cellular mechanisms triggered by oxygen deprivation), peroxisome proliferator-activated receptor binding protein (which plays a role in transcriptional coactivation), prostate-specific antigen (a protease involved in the liquefaction of the seminal coagulum, and one of the few successful cancer diagnostics), selenoprotein P (contains selenocyteines encoded by the opal codon, UGA), signal recognition particle receptor ␣ subunit (an integral membrane protein ensuring, in conjunction with srp, the correct targeting of the nascent secretory proteins to the endoplasmic reticulum membrane system), squamous cell carcinoma antigen 1 (which may act as a protease inhibitor to modulate the host immune response against tumor cells), and V-kit Hardy-Zuckerman 4 feline sarcoma viral oncogene homolog (the receptor for stem cell factor). A number of these proteins have obvious relevance to important disease mechanisms, and thus are of potential diagnostic value. Cathepsin S, centromere protein F, psoriasin, mismatch repair protein MSH2, oxygen-regulated protein, and signal recognition particle receptor ␣ subunit did not occur in the Lit accession list, but were rather found via detection in two of the experimental datasets. Two types of protein features (signal sequences and TM domains) were predicted from the H_Plasma_NR_v2 sequences. These two parameters are somewhat related, because transmembrane, as well as secreted, proteins are likely to contain signal sequences. In the 1,175 proteins of H_Plas-ma_NR_v2, approximately one-third were confidently predicted to contain signal sequences (34% overall, with 32% having 0 or 1 TM segments) as compared with 19% containing signal sequences over the whole human proteome (and only 10% containing signal sequences with 0 or 1 TM segments; R.F., unpublished observation). H_Plasma_NR_v2 is thus substantially (ϳ3:1) enriched in a set of proteins having signal sequences and 0 or 1 TM segments (compared with the genome), which is consistent with the presence of a large number of classical secreted proteins. However, because more than half of H_Plasma_NR_v2 proteins do not contain a signal sequence, the total representation of nonsecreted molecules (presumably cellular constituents) is high.
In the full NR set of 1,175 proteins (H_Plasma_NR_v2), several major groups of proteins occur in patterns that sug-gest interesting biases between our four data sources. At least 10 transcription factors were observed in the experimental sets (each by only a single method), and none of these were found in the Lit accession set. Similarly, proteins GOannotated with a DNA-binding function were essentially absent from the Lit set. In contrast, only 4 of 39 cytokines and growth factors included were found in any of the experimental datasets (IL-6, IL-12A, ciliary neurotrophic factor, and FGF-12), while 37 occurred in the Lit set. These results suggest that while the experimental proteomics methods were not sensitive enough to detect most cytokines and hormones, they did detect important classes of proteins not detected in literature reports using targeted assay methods. On a more global level, the distribution of GO_component assignments (Fig. 6) shows substantial differences overall between the set of proteins found in a literature search versus the three experimental proteomics technologies. Predicted features of protein sequence also show major source-related differences. Most striking is the fact that the Lit set was strongly biased toward proteins that were confidently predicted to have signal sequences (ratio of "signal confident" to "no" of 3.85), while 2DEMS showed little preference (0.91) and the LCMS methods showed a moderate (2 to 1) bias toward proteins without signal sequences (ratios of 0.49 and 0.44). The strong bias of the Lit set toward signal sequences is likely due to the greater ease with which these more soluble proteins can be isolated and studied by biochemical techniques. Similarly, the difference between 2DEMS and LCMS may be due to the failure of many less-soluble proteins to focus in the first (isoelectric focusing) dimension of the 2DE procedure (e.g. intact membrane or very large proteins), or else to the presence of numerous isoforms that divide the protein among members of a charge train, and thus decrease the limit of detection (e.g. heavily glycosylated extracellular domains cleaved from membrane proteins). By placing fewer requirements on the behavior of sample proteins prior to digestion, and by providing identifications based on a few soluble peptides, the MSbased techniques provided a significantly less biased, though not necessarily more complete, view of the plasma proteome.
Because the four input protein sets were of similar size, it is clear that the literature, viewed as an historical summary of research on proteins in plasma, shows a bias toward secreted proteins and against investigation of cellular proteins in solution in blood. This effect cannot be due to detection sensitivity alone, because the low-abundance cytokines, generally not detected by the experimental proteomics methods, are accessible via immunoassay and widely reported in the literature. The bias may instead be due to a general skepticism that detectable amounts of many cellular proteins are being released into plasma (absent some major cause of tissue damage), or to a view that cellular protein release, if it occurred, would not be especially informative. The present results, built on experimental studies of multiple groups, demonstrate that many (perhaps all) cellular proteins are present in plasma. The demonstrated utility of cardiac muscle protein markers as serum diagnostics for myocardial infraction provides a persuasive argument that many may have diagnostic use.
The next major challenge thus becomes the systematic exploration of protein abundance and structural modification in relation to disease, normal physiological processes, and treatment effects. In a sense this shift can be seen as analogous to the current evolution in pharmaceutical target selection: the genome has provided a wealth (in practical terms an overabundance) of previously unknown therapeutic targets, creating a major challenge in selecting those that are "druggable" and specifically linked to disease mechanisms. A shift, in other words, from protein discovery to target validation. In the context of diagnostics based on proteins in blood, we now have in H_Plasma_NR_v2, and a growing body of other experimental data, a substantial set of candidate disease markers that can be detected in plasma. While these will continue to be supplemented by discovery techniques (22), the stage is set for systematic efforts to validate disease markers for nearterm application in clinical trials, medium-term use in disease detection, staging, and therapy selection, and long-term use in population screening. While some of these proteins have been examined individually as potential markers and found to have low sensitivity or specificity as individual tests, growing evidence indicates that these limitations may be overcome using fingerprints of change across panels of proteins that  together better represent patient status. Thus even the present H_Plasma_NR_v2 offers an abundance of candidates deserving of measurement in selected clinical sample sets. Making such measurements, accurately and on a large scale presents a series of technical challenges likely to require substantial efforts for much of the next decade. The results of such a "targeted" proteomics effort can transform diagnostics, improve therapy, and lead to substantial and needed improvements in the economics of healthcare. * The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.