Population Proteomics

This review outlines the concept of population proteomics and its implication in the discovery and validation of cancer-specific protein modulations. Population proteomics is an applied subdiscipline of proteomics engaging in the investigation of human proteins across and within populations to define and better understand protein diversity. Population proteomics focuses on interrogation of specific proteins from large number of individuals, utilizing top-down, targeted affinity mass spectrometry approaches to probe protein modifications. Deglycosylation, sequence truncations, side-chain residue modifications, and other modifications have been reported for myriad of proteins, yet little is know about their incidence rate in the general population. Such information can be gathered via population proteomics and would greatly aid the biomarker discovery efforts. Discovery of novel protein modifications is also expected from such large scale population proteomics, expanding the protein knowledge database. In regard to cancer protein biomarkers, their validation via population proteomics-based approaches is advantageous as mass spectrometry detection is used both in the discovery and validation process, which is essential for the detection of those structurally modified protein biomarkers.

In the past 5 years proteomics and its subdiscipline clinical proteomics (1)(2)(3) have taken a leading role in the discovery of new and improved cancer biomarkers. In its inception, clinical proteomics was an application of high end MS tools for evaluation of proteomic differences in the proteomes between healthy and cancer-bearing individuals. The tools were mainly based on the SELDI technology and utilization of wide specificity surfaces (e.g. hydrophobic or hydrophilic mass spectrometer targets) to bind groups of proteins from biological samples and generate distinct mass spectra containing hundreds of peaks (4,5). These so-called proteomic patterns entered mainstream proteomics with the publication of a Lancet study in 2002 (6), which was quickly followed by a significant number of research studies (7)(8)(9)(10)(11) and reviews (12)(13)(14)(15) describing proteomic patterns capable of discriminating be-tween disease (e.g. ovarian cancer, prostate cancer, etc.) and healthy cohorts with relatively high sensitivity and specificity. However, critical assessment of those results quickly followed, outlining significant shortcomings and uncertainties in regard to the reproducibility of the findings, identity of the proteins behind the pattern peaks, and validation of the results (16 -23). Interlaboratory SELDI experiments performed recently alleviated some of the reproducibility concerns (24,25). Furthermore, current research efforts have led to identification of several proteins behind the discriminating patterns peaks, including serum amyloid A (26), vitamin D-binding protein (27), and apolipoprotein A-II (28), which were identified as potential biomarkers for prostate cancer; haptoglobin (29), apolipoprotein A-I, transthyretin, and inter-␣-trypsin inhibitor (30) as biomarkers for ovarian cancer; and complement component C3a as a biomarker for breast cancer (31). Interestingly, common to all those putative cancer biomarkers is the fact that they are all relatively high abundance plasma proteins. On the other hand, established cancer biomarkers (e.g. prostate-specific antigen) have relatively low plasma concentrations and are never seen in such SELDI analyses, raising some questions about the validity of the findings and the general profiling approach in long term applications. Nevertheless the differential screening approach seems to yield potential biomarkers, which will subsequently require vigorous quantitation.
Validation of the newly discovered biomarkers is the most challenging aspect of clinical proteomics. In a scenario that slowly begins to emerge, such validation efforts would center around targeted analysis of the biomarkers (assayed either individually or in groups for better predictive values) via traditional immunoassay methods such as ELISA and/or protein microarrays (32,33). The immunoassays can readily be utilized in high throughput, large scale, case/control studies to validate the clinical proteomics-discovered protein biomarker(s). However, these approaches cannot detect structurally modified proteins that have been indicated as putative cancer biomarkers (28,30,31). Standard immunoassay approaches utilize detection labels that cannot discriminate between structural protein modifications due to the fact that the resulting quantitative signal is the sum of signals from all isoforms for a given protein captured by the primary affinity ligand. Indiscrimination plagues even some of the newer label-free methods of detection such as surface plasmon resonance (34,35). Hence it can be argued that validation of the clinical proteomics-derived data on protein structural modifications can most readily be done utilizing the same method of detection used for their discovery: mass spectrometry.

MASS SPECTROMETRY IN BIOMARKER VALIDATION
Mass spectrometry is the only detection method today that can universally provide information about specific protein structural modifications without a priori knowledge of the modification. MS interrogates the protein mass, which is an intrinsic property of each fully expressed and functional protein. The mass contains information about the gene that encodes the protein and the postexpression processing that the protein undergoes. Any changes in the gene sequence and/or postexpression protein processing will be reflected in the mass of the whole protein.
There are two main approaches to characterizing proteins via mass spectrometry: bottom up and top down. In the bottom-up approach, proteins present in biological samples are enzymatically digested, their constituent peptide fragments are detected via MS (after a certain fractionation step), and information about the protein identity and modifications is assembled from the peptide data. Although it is easier to analyze peptides via MS (due to ionization efficiency, sensitivity, and mass accuracy advantages), these bottom-up approaches can significantly increase the complexity of the starting sample by generating thousands of peptides in a protein mixture. Furthermore, approaches such as LC-MS/MS tend to be quite complex and difficult for application across different laboratories with a high level of reproducibility. Throughput also remains a significant bottleneck for their application in clinical proteomics wherein hundreds of samples need to be analyzed. And finally by relying on only a few peptides for protein identification in these bottom-up approaches, a large portion of the protein sequence remains "unassayed," eliciting the question as to what possible modifications in the protein sequence go undetected in these analyses.
The top-down MS approaches to protein analysis alleviate some of these concerns by assaying native (intact) proteins first and doing peptide fingerprinting (if needed) next. Most of the new cancer biomarker discoveries made so far have utilized the SELDI technology (5), which is a modified top-down approach. In its most common application, groups of proteins that share similar characteristics (i.e. hydrophobicity, chelating motifs, etc.) are affinity-retrieved simultaneously, and their masses are recorded in the mass spectra. However, protein identification is difficult due to the simultaneous analyses of tens, if not hundreds, of proteins. Recent trends in proteomics have focused on removing the six most abundant serum proteins constituting ϳ85% of the protein content by mass (36,37). Nevertheless the remaining 15% of the proteome still represents thousands of proteins with a large dynamic range of concentrations, which are hard to analyze in a single experiment.
Interestingly, most protein biomarkers discovered so far via SELDI have been detected as peptide fragments in the low m/z region (where the MS offers better resolution). Although quantitative modulations of these peptides can be indicative of diseases, important information about the rest of the protein is left out in the analyses. The only way to get complete protein sequence coverage is by deciphering the mass of the intact protein. An affinity-based MS method that first specifically isolates a protein of interest and then interrogates its native molecular mass would be an ideal top-down proteomics-based approach. Such an approach would be similar to ELISA (targeted, selective, high in throughput, reproducible, and sensitive) but would utilize mass spectrometry as a method of detection.

AFFINITY-BASED MS METHODS
Selective affinity capture of proteins in preparation for mass spectrometry was initially demonstrated in the early nineties (38 -41). Its utility for delineating protein isoforms has been well documented (42)(43)(44)(45)(46)(47)(48)(49). In its simplest form, surface-immobilized ligands are utilized to affinity retrieve a protein of interest from a biological sample after which the protein (with or without the affinity ligand) is introduced in a mass spectrometer. One of the first affinity MS methods was developed in this laboratory and termed mass spectrometric immunoassay (MSIA) 1 (41,50). The approach combines targeted protein affinity extraction with rigorous characterization using MALDI-TOF mass spectrometry (Fig. 1). Protein(s) are extracted from a biological sample with the help of affinity pipettes derivatized with polyclonal antibodies. The proteins are then eluted from the affinity pipettes with a MALDI matrix and are analyzed by MS. Enzymatic digestion, if needed, is performed on the MALDI target itself. Specificity and sensitivity, as in traditional immunoassays, are dictated by the affinity capture reagents: the antibodies. However, a second measure of specificity is incorporated in the resulting mass spectra wherein each protein registers at a specific m/z value. During data analysis, the major signal in the mass spectrum that corresponds to the targeted protein is initially evaluated; it should be within a reasonable range (e.g. error of measurement of Ͻ0.05%) from the value of the empirically calculated mass obtained from the sequence of the protein deposited in the Swiss-Prot data bank. Once this mass value is confirmed (or observed to be shifted), the presence of protein modifications is noted by the appearance of other signals in the mass spectra (usually in the vicinity of the native protein peaks) or by mass shifts of the major protein signal. Modifications can be tentatively assigned by accurate measurement of the observed mass shifts (from the wild-type protein signals and/or in silico calculated mass) and knowledge of the protein sequence and possible modifications. The identity of the mod-ifications is then verified using proteolytic digestion and mass mapping approaches in combination with high performance mass spectrometry.
The simplicity and high throughput capability of MSIA makes it an ideal candidate for validation of putative cancer biomarkers discovered through clinical proteomics. However, the approach is equally applicable for biomarker discovery. Proteins of interest can be assayed from healthy and cancerdiagnosed cohorts of samples, searching for posttranslational modifications and their association with disease states (51)(52)(53). Similarly, proteins from healthy human subjects can be interrogated to catalogue the extent of certain modifications across the general human populace. In one such study, 25 plasma proteins from a cohort of 96 healthy individuals were investigated via affinity-based mass spectrometric assays (54). Fig. 2 summarizes the results of the study, listing the modifications observed and showing the frequency of each modification found in the 96-sample cohort. A total of 53 protein variants were observed for 18 proteins (seven did not exhibit any readily observable modification). Posttranslational modifications were detected in 18 proteins; the largest number of variants was found to be C-or N-terminal truncated versions of the targeted protein. Deglycosylation, oxidation, and cysteinylation were also observed among several of the proteins. Point mutation was detected for four proteins. Prominently, a high incidence of point mutations was noted for apolipoprotein E and for transthyretin, consistent with genomic studies that have found these proteins to be highly polymorphic (55,56). Overall the occurrence of these variations was wide ranged with some modifications being observed in only one sample and others detected in all 96 samples, suggesting that they must be regarded as wild-type protein forms. Interprotein variations in specific individuals were also observed: all seven individuals observed to have deglycosylated transferrin had deglycosylated antithrombin II as well. In all, the results from this study offer a small glimpse ). Modifications were not detected for albumin, ceruloplasmin, C-reactive protein, insulin-like growth factor I, lysozyme, plasminogen, and urine protein 1. Footnote 1, the cystatin C point mutation could not be assigned because of the Ͻ50% sequence coverage resulting from the peptide mapping. Footnote 2, the SAA1␣ point mutation was tentatively assigned as K90R substitution. Reprinted with permission from Nedelkov et al. (54). at the extent of protein diversity in the human population and set the stage for a new applied subdiscipline of proteomics we have termed population proteomics.

POPULATION PROTEOMICS
Population proteomics is the study of protein diversity in human populations, which can be defined as the investigation of human proteins across and within populations to define and understand protein differences and to facilitate the discovery and validation of disease-specific protein modulations. In broader terms, population proteomics can be compared with population genomics where individuals are interrogated with the aim of cataloguing common genetic variants and determining how they are distributed among people within populations and among populations in different parts of the world. Population genomics today is best exemplified through the International HapMap Project, which is a collaboration among scientists and funding agencies from several countries (57). Although population proteomics cannot (yet) claim such outreach and goals (after all, it is still a concept that has just recently been outlined (58 -60)), it certainly has the potential to become an important applied subdiscipline of proteomics. As the enabling tools and approaches become more embraced and practiced, population proteomics stands to provide answers about the extent of protein modifications across and within populations and their association with disease.
There are several attributes that define and distinguish population proteomics. The first one is the use of targeted proteomics-based approaches. Population proteomics does not entail analysis of entire proteomes because it is very likely that, for a specific cell or tissue proteome, there is no definitive set and number of proteins that is common to all within a group or a larger population. The expression, regulation, and dynamics of the proteome are influenced by cell processes, environmental factors, and cell and body cycles, all of which interplay and affect this "fluid" state of the proteome. Hence population proteomics focuses on interrogation of a selected number of proteins but from a large number of individuals. Approaches utilizing mass spectrometric detection offer clear advantages in such studies. The MS methods must be capable of analyzing hundreds, if not thousands, of samples per day with high reproducibility and sensitivity. Hence top-down MS approaches utilizing affinity ligands are the most likely methods of choice in such population proteomics and can encompass MSIA, bead-based MS methods (61)(62)(63), and ligand-specific SELDI approaches (64,65).
Both plasma and urine provide good medium for such wide range proteome examination as they contain almost every protein (at some point of time and in some shape and form) that the human body and cells produce (including cancer cells). Although the wide dynamic ranges of protein concentrations and their molecular masses in plasma remain big challenges for the wide specificity proteomics-based approaches, these obstacles are minimized with targeted pro-teomics-based approaches that will be used in population proteomics. Certainly some protein-specific pretreatment of those samples (e.g. addition of buffers, stabilizers, etc.) might be implemented, but major fractionation steps should be avoided so that the overall process and method remains simple and high in throughput.
In the case of healthy populations, the first candidates for population proteomics are well studied (and generally higher concentration) plasma proteins because of availability of well characterized affinity reagents (antibodies). Then proteins at lower concentrations in human plasma can be progressively addressed. It is important to emphasize that as of today there are virtually no data on the distribution of specific posttranslational modifications across the general populace, even for the most abundant proteins. Deglycosylation, sequence truncations, and side-chain residue modifications (phosphorylation, sulfonation, oxidation, etc.) have been reported for a myriad of proteins, yet to date a concerted effort has not been undertaken to assess the incidence of these structural modifications in individual proteins in the general population. Hence the first aim of population proteomics is to catalogue protein modifications and establish their frequency in the general population. Those modifications that are found to occur at high frequency can be declared wild type, and unless they undergo a quantitative modulation in response to a specific disease, they stand to bear little significance in future biomarker discovery efforts. It is low frequency protein modifications that are most likely to be of greater significance as potential biomarkers of disease. To detect them, thousands of individuals will need to be analyzed.
It is expected that such large scale population proteomics will result in the discovery of a plethora of novel protein structural modifications. As an example, recent applications of the MSIA approach to relatively small number of human plasma samples resulted in the discovery of new posttranslational modifications for several plasma proteins, including apolipoprotein A-I (66), apolipoprotein A-II (66), C-reactive protein (51), insulin-like growth factor II (49), retinol-binding protein (52,67), serum amyloid A (53), and serum amyloid P (47). Consequently the Swiss-Prot entries for these proteins have been annotated with the new modifications. Expanding the protein knowledge database via discovery of novel modifications is a key element of population proteomics.
Investigations in population proteomics will also be quantitative in nature. Relative and absolute protein quantification via mass spectrometry is possible, albeit careful experimental design must be executed. Furthermore, there is a vast amount of immunoassay-derived data on quantitative modulations of proteins in specific biological fluids such as human plasma. Concentration ranges, in both healthy and diseased states, have been determined for a large number of proteins (68,69); these data represent an invaluable resource for population proteomics.
In regard to cancer and other diseases, there is already a vast amount of data that correlates certain proteins and their quantitative modulations with specific cancers. These proteins would be the first to benefit from qualitative reassessment via population proteomics; next in line are proteins that are within the interaction network and pathways of the biomarker proteins. Structural variants have recently been discovered for some well established protein biomarkers, including prostate-specific antigen (70,71) and cardiac troponin I (72,73). These variants have potential for serving as more specific and sensitive markers of disease and would need to be analyzed via population proteomics. Common serum proteins also need to be studied via population proteomics. Despite their relatively high concentration in plasma, almost all of the proteins listed in Fig. 2 have been associated with certain types of cancer. In support of the clinical data, it can be argued that during early cancer development a number of events and pathways in the human body are triggered that can possibly result in quantitative and qualitative modulations of these high level plasma proteins. Although those processes and pathways may remain unknown for the time being, the end results (the protein modifications) can be detected via targeted proteomics-based approaches and when validated could be used as biomarkers for early detection of cancer. Hence the detection of the structural changes in these proteins can be viewed as highly significant as they might offer better means of cancer diagnosis than current methods of cancer detection, which are plagued by low specificity and/or sensitivity.
In summary, we have entered a transition stage wherein proteomics has evolved from a technology-driven field into an application-driven discipline whose subject is the study of proteins in human beings. Population proteomics, as described here, is an applied subdiscipline of proteomics engaging in the long term study of the human protein diversity with clear implications in the cancer biomarker discovery field. Assessing variations in human proteins among and within populations is a paramount undertaking that can facilitate clinical proteomics in discovery and validation of protein features that can be used as markers for early diagnosis of cancer, monitoring of disease progression, and assessment of therapy. * The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.