Advertisement

Interpretation of Shotgun Proteomic Data

      The shotgun proteomic strategy based on digesting proteins into peptides and sequencing them using tandem mass spectrometry and automated database searching has become the method of choice for identifying proteins in most large scale studies. However, the peptide-centric nature of shotgun proteomics complicates the analysis and biological interpretation of the data especially in the case of higher eukaryote organisms. The same peptide sequence can be present in multiple different proteins or protein isoforms. Such shared peptides therefore can lead to ambiguities in determining the identities of sample proteins. In this article we illustrate the difficulties of interpreting shotgun proteomic data and discuss the need for common nomenclature and transparent informatic approaches. We also discuss related issues such as the state of protein sequence databases and their role in shotgun proteomic analysis, interpretation of relative peptide quantification data in the presence of multiple protein isoforms, the integration of proteomic and transcriptional data, and the development of a computational infrastructure for the integration of multiple diverse datasets.
      An explicit goal of proteomics is the identification and quantification of all the proteins expressed in a cell or tissue (
      • Aebersold R.
      • Mann M.
      Mass spectrometry-based proteomics.
      ). Although not yet at the levels of data throughput and automation achieved in other genomic analyses such as DNA sequencing or microarray gene expression analysis, global protein profiling methods are rapidly evolving. This has been possible because of recent improvements in MS instrumentation, protein and peptide separation techniques, computational data analysis tools, and the availability of complete sequence databases for many species. As a result, analysis of complex protein mixtures using shotgun proteomics, a strategy based on the combination of protein digestion and MS/MS-based peptide sequencing (
      • Link A.J.
      • Eng J.
      • Schieltz D.M.
      • Carmack E.
      • Mize G.J.
      • Morris D.R.
      • Garvik B.M.
      • Yates J.R.
      Direct analysis of protein complexes using mass spectrometry.
      ,
      • Gygi S.P.
      • Rist B.
      • Gerber S.A.
      • Turecek F.
      • Gelb M.H.
      • Aebersold R.
      Quantitative analysis of complex protein mixtures using isotope-coded affinity tags.
      ,
      • Washburn M.P.
      • Wolters D.
      • Yates J.R.
      Large-scale analysis of the yeast proteome by multidimensional protein identification technology.
      ), has become widely adopted. The method allows protein identifications and, when combined with stable isotope labeling, quantification of the changes in the protein expression levels for hundreds of proteins in a single experiment (
      • Aebersold R.
      • Mann M.
      Mass spectrometry-based proteomics.
      ).
      Compared with other MS-based proteomic technologies such as intact proteins sequencing (
      • Reid G.E.
      • McLuckey S.A.
      “Top down” protein characterization via tandem mass spectrometry.
      ,
      • Meng F.
      • Forbes A.J.
      • Miller L.M.
      • Kelleher N.L.
      Detection and localization of protein modifications by high resolution tandem mass spectrometry.
      ) or 2D
      The abbreviations used are: 2D, two-dimensional; EST, expressed sequence tag; SILAC, stable isotope labeling by amino acids in cell culture; CNBP, cellular nucleic acid-binding protein.
      1The abbreviations used are: 2D, two-dimensional; EST, expressed sequence tag; SILAC, stable isotope labeling by amino acids in cell culture; CNBP, cellular nucleic acid-binding protein.
      gel-based protein analysis (
      • Gorg A.
      • Weiss W.
      • Dunn M.J.
      Current two-dimensional electrophoresis technology for proteomics.
      ), shotgun proteomic analysis has achieved a relatively high throughput. This is the result of a combination of several factors. Proteolytic digestion of proteins into shorter peptides simplifies MS/MS sequencing (peptides are easier to fragment in the mass spectrometer than intact proteins), whereas elimination of the 2D gel-based separation at the protein level simplifies sample handling and increases the overall data throughput. At the same time, computational analysis and interpretation of the data become more challenging (
      • Patterson S.D.
      Data analysis—the Achilles heel of proteomics.
      ,
      • Boguski M.S.
      • McIntosh M.W.
      Biomedical informatics for proteomics.
      ,
      • Nesvizhskii A.I.
      • Aebersold R.
      Analysis, statistical validation and dissemination of large-scale proteomics datasets generated by tandem MS.
      ,
      • Johnson R.S.
      • Davis M.T.
      • Taylor J.A.
      • Patterson S.D.
      Informatics for protein identification by mass spectrometry.
      ,
      • Russell S.A.
      • Old W.
      • Resing K.A.
      • Hunter L.
      Proteomic informatics.
      ,
      • Baldwin M.A.
      Protein identification by mass spectrometry: issues to be considered.
      ). The first and foremost computational challenge is the need to process large volumes of acquired MS/MS data with the purpose of identifying peptides that gave rise to observed spectra. This challenge is now well understood, and a number of computational methods and software tools, including programs for assigning peptides to MS/MS spectra (
      • Eng J.K.
      • McCormack A.L.
      • Yates J.R.
      An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database.
      ,
      • Mann M.
      • Wilm M.
      Error-tolerant identification of peptides in sequence databases by peptide sequence tags.
      ,
      • Perkins D.N.
      • Pappin D.J.
      • Creasy D.M.
      • Cottrell J.C.
      Probability-based protein identification by searching sequence databases using mass spectrometry data.
      ,
      • Clauser K.R.
      • Baker P.
      • Burlingame A.L.
      Role of accurate mass measurement (±10 ppm) in protein identification strategies employing MS or MS/MS and database searching.
      ,
      • Field H.I.
      • Fenyo D.
      • Beavis R.C.
      RADARS, a bioinformatics solution that automates proteome mass spectral analysis, optimizes protein identification, and archives data in a relational database.
      ,
      • Craig R.
      • Beavis R.C.
      TANDEM: matching proteins with tandem mass spectra.
      ,
      • Geer L.Y.
      • Markey S.P.
      • Kowalak J.A.
      • Wagner L.
      • Xu M.
      • Maynard D.M.
      • Yang X.
      • Shi W.
      • Bryant S.H.
      Open mass spectrometry search algorithm.
      ) and for statistical validation of those assignments (
      • Keller A.
      • Nesvizhskii A.I.
      • Kolker E.
      • Aebersold R.
      Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search.
      ,
      • Nesvizhskii A.I.
      • Keller A.
      • Kolker E.
      • Aebersold R.
      A statistical model for identifying proteins by tandem mass spectrometry.
      ,
      • Fenyo D.
      • Beavis R.C.
      A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes.
      ,
      • Peng J.
      • Elias J.E.
      • Thoreen C.C.
      • Licklider L.J.
      • Gygi S.P.
      Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large scale protein analysis: the yeast proteome.
      ,
      • Sadygov R.G.
      • Yates J.R.
      A hypergeometric probability model for protein identification and validation using tandem mass spectral data and protein sequence databases.
      ), have been developed. However, identification of peptides resulting from proteolytic digestion of sample proteins represents only an intermediate step because the ultimate goal of most experiments is to identify (and quantify when appropriate) the proteins that are present in the original sample. Increasingly it has been realized that the protein inference problem, i.e. the task of assembling the sequences of identified peptides to infer the protein content of the sample, is far from being trivial and requires special attention (
      • Nesvizhskii A.I.
      • Keller A.
      • Kolker E.
      • Aebersold R.
      A statistical model for identifying proteins by tandem mass spectrometry.
      ,
      • Rappsilber J.
      • Mann M.
      What does it mean to identify a protein in proteomics?.
      ,
      • Von Haller P.D.
      • Yi E.
      • Donohoe S.
      • Vaughn K.
      • Keller A.
      • Nesvizhskii A.I.
      • Eng J.
      • Li X.J.
      • Goodlett D.R.
      • Aebersold R.
      • Watts J.D.
      The application of new software tools to quantitative protein profiling via ICAT and tandem mass spectrometry: II. Evaluation of tandem mass spectrometry methodologies for large-scale protein analysis and the application of statistical tools for data analysis and interpretation.
      ,
      • Resing K.A.
      • Meyer-Arendt K.
      • Mendoza A.M.
      • Aveline-Wolf L.D.
      • Jonscher K.R.
      • Pierce K.G.
      • Old W.M.
      • Cheung H.T.
      • Russell S.
      • Wattawa J.L.
      • Goehle G.R.
      • Knight R.D.
      • Ahn N.G.
      Improving reproducibility and sensitivity in identifying human proteins by shotgun proteomics.
      ,
      • Carr S.
      • Aebersold R.
      • Baldwin M.
      • Burlingame A.
      • Clauser K.
      • Nesvizhskii A.
      The need for guidelines in publication of peptide and protein identification data.
      ,
      • Yang X.
      • Dondeti V.
      • Dezube R.
      • Maynard D.M.
      • Geer L.Y.
      • Epstein J.
      • Chen X.
      • Markey S.P.
      • Kowalak J.A.
      DBParser: web-based software for shotgun proteomic data analyses.
      ).
      The difficulty of assembling peptide identifications back to the protein level results from the same factors that made the shotgun proteomic approach so successful in the first place, i.e. protein digestion at an early stage of the process and elimination of extensive separation at the protein level. Protein digestion makes peptides, and not the proteins, the currency of the method, and the connectivity between peptides and proteins is lost at the digestion stage. This loss of connectivity complicates computational analysis and biological interpretation of the data especially in the case of higher eukaryote organisms. The same peptide sequence can be present in multiple different proteins. Therefore, the identification of such shared
      Also referred to as degenerate peptides (see e.g. Ref.
      • Nesvizhskii A.I.
      • Keller A.
      • Kolker E.
      • Aebersold R.
      A statistical model for identifying proteins by tandem mass spectrometry.
      ).
      2Also referred to as degenerate peptides (see e.g. Ref.
      • Nesvizhskii A.I.
      • Keller A.
      • Kolker E.
      • Aebersold R.
      A statistical model for identifying proteins by tandem mass spectrometry.
      ).
      peptides can lead to ambiguities in the determination of the identities of the sample proteins (see Fig. 1). In general, protein identification is a less complex issue
      This is true, however, only under the assumption that only a single protein is present in a spot on the gel, which is not always the case.
      3This is true, however, only under the assumption that only a single protein is present in a spot on the gel, which is not always the case.
      if proteins are first separated using a multidimensional protein separation technique (e.g. 2D gels) where additional information, such as the protein molecular weight and isoelectric point, can assist in determination of the protein identities (see e.g. Refs.
      • Gorg A.
      • Weiss W.
      • Dunn M.J.
      Current two-dimensional electrophoresis technology for proteomics.
      and
      • Pedersen S.K.
      • Harry J.L.
      • Sebastian L.
      • Baker J.
      • Traini M.D.
      • McCarthy J.T.
      • Manoharan A.
      • Wilkins M.R.
      • Gooley A.A.
      • Righetti P.G.
      • Packer N.H.
      • Williams K.L.
      • Herbert B.R.
      Unseen proteome: mining below the tip of the iceberg to find low abundance and membrane proteins.
      ,
      • Fung K.Y.
      • Glode L.M.
      • Green S.
      • Duncan M.W.
      A comprehensive characterization of the peptide and protein constituents of human seminal fluid.
      ,
      • Godovac-Zimmermann J.
      • Kleiner O.
      • Brown L.L.
      • Drukier A.L.
      Perspectives in splicing up proteomics with splicing.
      ).
      Figure thumbnail gr1
      Fig. 1Protein identification using MS/MS. On the left is a depiction of the process in a typical 2D gel-based approach: the proteins are separated, visualized, and excised from the gel. On the right is the process involved for shotgun proteomics. In both approaches, sample proteins are proteolytically digested into peptides, and resulting peptides mixtures are separated using liquid chromatography. Peptides are ionized, and selected peptide ions are subjected to MS/MS sequencing. Peptide sequences are determined from MS/MS spectra using a database search approach. Any given peptide may be a part of the sequences of several different proteins. The protein inference problem involves figuring out which proteins are present in the sample given the sequences of identified peptides. In this example, the sample contains two proteins, A and B, which share extensive sequence homology. All three identified peptides, AEMK, GAGGLR, and HYFEDR, are present in the sequence of protein B, and the last two peptides are also in the sequence of protein A. In the shotgun proteomic approach, the connectivity between peptides and proteins is lost; no information on the number of proteins in the sample or their properties (e.g. molecular weight) is available. It is not possible to conclude that protein A is present in the sample because protein B can account for all observed peptides. This is less of a problem in the case of the 2D gel-based approach where proteins are separated prior to digestion and MS/MS analysis.
      In this article we illustrate the protein inference problem of shotgun proteomics using a set of examples and discuss the need for common nomenclature and transparent informatic approaches for assembling peptides into proteins and presenting the results of shotgun proteomic experiments to the user. We also discuss related issues such as the state of protein sequence databases and their role in shotgun proteomic analysis, interpretation of quantitative proteomic data in the presence of multiple protein isoforms, correlation of proteomic and transcriptional data, and comparison and integration of shotgun proteomic data generated in different experiments.

      THE PROTEIN INFERENCE PROBLEM: CASE STUDIES

      Although the shotgun proteomic approach is peptide-centric, in most cases researchers are ultimately interested in knowing what proteins are present in the analyzed sample. Inferring protein identities given a set of identified peptides becomes difficult in the case of higher eukaryote organisms. This is due to sequence redundancy, i.e. the presence of distinct proteins having a high degree of sequence homology, as is the case in protein families, alternative splice forms of the same gene, differentially processed proteins, and more (
      • Rappsilber J.
      • Mann M.
      What does it mean to identify a protein in proteomics?.
      ,
      • Black D.L.
      Protein diversity from alternative splicing: a challenge for bioinformatics and post-genome biology.
      ).
      Although the identification of a single peptide is often sufficient to conclude that a product of a certain gene is present in the sample, it is often not possible to discriminate between different proteins that share extensive homology or are isoforms arising from alternatively spliced genes as illustrated in Fig. 2. All examples used in this work, except where noted, were found in the dataset from an experiment on lipid raft plasma membrane domains from human Jurkat human T cells (
      • Von Haller P.D.
      • Yi E.
      • Donohoe S.
      • Vaughn K.
      • Keller A.
      • Nesvizhskii A.I.
      • Eng J.
      • Li X.J.
      • Goodlett D.R.
      • Aebersold R.
      • Watts J.D.
      The application of new software tools to quantitative protein profiling via ICAT and tandem mass spectrometry: II. Evaluation of tandem mass spectrometry methodologies for large-scale protein analysis and the application of statistical tools for data analysis and interpretation.
      ).
      Peptide and protein identification part of the analysis presented in Ref.
      • Von Haller P.D.
      • Yi E.
      • Donohoe S.
      • Vaughn K.
      • Keller A.
      • Nesvizhskii A.I.
      • Eng J.
      • Li X.J.
      • Goodlett D.R.
      • Aebersold R.
      • Watts J.D.
      The application of new software tools to quantitative protein profiling via ICAT and tandem mass spectrometry: II. Evaluation of tandem mass spectrometry methodologies for large-scale protein analysis and the application of statistical tools for data analysis and interpretation.
      was repeated in this work using the Human IPI version 2.35 protein sequence database. Quantitative information (not discussed in the original publication) was extracted from the data using an automated tool, ASAPRatio (
      • Li X.J.
      • Zhang H.
      • Ranish J.A.
      • Aebersold R.
      Automated statistical analysis of protein abundance ratios from data generated by stable-isotope dilution and tandem mass spectrometry.
      ), and then confirmed by manual inspection.
      In the first example, Fig. 2A, several peptides were identified that are present in two different splice forms of the F-actin capping protein β subunit, CAPB_HUMAN, P47756-1 and P47756-2. Because there are no peptides identified in the experiment that would correspond to one of the isoforms only, both isoforms are equally likely; any one of them, or both, could be present in the sample. In this particular case, the sequences of the two isoforms differ significantly at the C terminus. Thus, discrimination between these two isoforms using sequence information alone would be possible only with the identification of peptides spanning the areas where the sequences diverge.
      Figure thumbnail gr2
      Fig. 2Sequences of identified peptides often do not allow discrimination between different protein isoforms.A, multiple peptides are identified that are present in two different splice forms, P47756-1 and P47756-2, of the F-actin capping protein β subunit. The alignment of the sequences of the two isoforms is shown, and the sequences of the identified peptides are shown in bold. The isoforms are indistinguishable given the available data. The discrimination would be possible if peptides spanning the areas where the sequences diverge, e.g. SIDAIPDNQK (unique to P47756-1) and SVQTFADK (unique to P47756-2), were identified. B, protein isoforms of epithelial protein lost in neoplasm. Three isoforms result from splicing of several consecutive exons located at the 5′-end in the gene sequence. Given the sequences of the identified peptides (shown in bold), it is not possible to determine precisely which isoform is present. The sequences of the two shorter isoforms (Q9UHB6-2 and Q9UHB6-3) are included in the sequence of the longer isoform (Q9UHB6-1). Identification of a peptide from the region present in the isoform β only (sequence shown in a box) would allow conclusive identification of this isoform (no such peptides were actually observed in the experiment). Conclusive identification of the shorter isoforms would be difficult because they do not contain any unique sequence.
      Conclusive identification of alternative splice forms arising from skipping of one or more consecutive exons at the 5′- or 3′-end of the gene sequence (without introduction of any divergent sequence) is more challenging. One such example is shown in Fig. 2B. The sequences of the two shorter isoforms of the epithelial protein lost in neoplasm, EPLI_HUMAN (Q9UHB6-2 and Q9UHB6-3, isoforms α and 3, respectively) are included in the sequence of the longer isoform (Q9UHB6-1, isoform β). No conclusive evidence for the presence of the shorter isoforms in the sample can be obtained without any additional information, e.g. molecular weight of the sample proteins.
      It should be noted that the absence of evidence should not be interpreted as the evidence of absence of the protein in the sample.
      Furthermore in some cases it is not possible to discriminate between proteins that are products of different genes from the same gene family (gene paralogues) (
      • Nesvizhskii A.I.
      • Keller A.
      • Kolker E.
      • Aebersold R.
      A statistical model for identifying proteins by tandem mass spectrometry.
      ,
      • Rappsilber J.
      • Mann M.
      What does it mean to identify a protein in proteomics?.
      ,
      • Delalande F.
      • Carapito C.
      • Brizard J.P.
      • Brigidou C.
      • Dorsselaer A.V.
      Multigenic families and proteomics: Extended protein characterization as a tool for paralog gene identification.
      ,
      • Sam-Yellowe T.Y.
      • Florens L.
      • Johnson J.R.
      • Wang T.
      • Drazba J.A.
      • Le Roch K.G.
      • Zhou Y.
      • Batalov S.
      • Carucci D.J.
      • Winzeler E.A.
      • Yates J.R.
      A Plasmodium gene family encoding Maurer’s cleft membrane proteins: structural properties and expression profiling.
      ). This is illustrated in Fig. 3. A total of 11 peptides were identified in the dataset that are shared between more than a dozen different members of the α-tubulin family. None of the identified peptides is unique to any of the proteins. Thus, although those peptides clearly indicate the presence of one or more α-tubulin proteins, it is not possible to determine which particular member(s) of that family is present in the sample.
      Figure thumbnail gr3
      Fig. 3An example of a protein family. Eleven tryptic peptides are identified that are shared between the members of the α-tubulin family. None of the proteins is identified by a peptide that is unique to it, thus making it impossible to determine which particular member(s) of the family is present in the sample.
      Interpretation of the data is further complicated due to artificial redundancies, e.g. truncated sequences, sequence alternatives arising from sequencing errors, and existence of essentially the same sequences under different gene names, features that exist in many protein sequence databases. This is illustrated in Fig. 4 where a group of peptides was identified that are shared between four different entries in the Entrez Protein sequence database maintained by the National Center for Biotechnology Information (NCBI).
      For the purpose of this discussion, the dataset of MS/MS spectra from Ref.
      • Von Haller P.D.
      • Yi E.
      • Donohoe S.
      • Vaughn K.
      • Keller A.
      • Nesvizhskii A.I.
      • Eng J.
      • Li X.J.
      • Goodlett D.R.
      • Aebersold R.
      • Watts J.D.
      The application of new software tools to quantitative protein profiling via ICAT and tandem mass spectrometry: II. Evaluation of tandem mass spectrometry methodologies for large-scale protein analysis and the application of statistical tools for data analysis and interpretation.
      was also searched against the Entrez Protein database (downloaded in November 2004).
      Manual examination revealed that all four database entries represent the same protein, heat shock 70-kDa protein 9B (HSPA7B). Three of those entries are derived from mRNA sequences containing small sequence variations. In this particular case, the sequence variations are likely due to sequencing errors. However, these variations could also be polymorphisms, i.e. real sequence variants of the same protein from a different individual. In many cases, such database redundancies can only be resolved on a case by case basis by researchers analyzing the data. This presents an additional challenge for the development of automated informatic tools for dealing with large scale proteomic datasets.
      Figure thumbnail gr4
      Fig. 4Sequence database redundancies complicate the analysis of shotgun proteomic data. Four separate entries in the Entrez Protein database represent the same protein, heat shock 70-kDa protein 9B. Three of them (entries 2–4) are derived from mRNA sequences containing small sequence variations.
      Another problem encountered in shotgun proteomics is the difficulty of assigning the correct peptide sequence to an MS/MS spectrum. Two of the amino acids (Ile/Leu) have identical masses, and the difference between several other amino acid combinations (e.g. Asp/Asn and Glu/Gln/Lys) cannot be resolved using low mass accuracy instruments such as the commonly used ion traps. If the database contains several peptides with a similar molecular weight and having a high degree of sequence homology, determination of the correct peptide sequence among the alternatives becomes difficult or even impossible (in the case of Ile/Leu substitutions). This can result in the assignment of incorrect (although homologous) peptide sequences to MS/MS spectra, which in turn can result in incorrect protein identifications. Some but not all ambiguities can be resolved when using high mass accuracy instruments such as LTQ-FT or even Q-TOF.
      Distinguishing between different proteins having a high degree of sequence similarity becomes increasingly difficult with decreasing protein sequence coverage, i.e. the fraction of the protein sequence covered by the identified peptides. The protein coverage observed in shotgun proteomic experiments is typically low. The number of “identifiable” peptides is obviously limited by the size of the protein but also by such factors as the enzymatic digestion constraint and the detection mass range of the mass spectrometer. In some cases, the pool of potential peptide identifications is further reduced as a result of selective enrichment for a particular class of peptides, e.g. cysteine-containing peptides in quantitative proteomic experiments based on ICAT reagents (
      • Gygi S.P.
      • Rist B.
      • Gerber S.A.
      • Turecek F.
      • Gelb M.H.
      • Aebersold R.
      Quantitative analysis of complex protein mixtures using isotope-coded affinity tags.
      ). The identification of some peptides can be prevented by unexpected post-translational modifications. Furthermore because multiple different peptide ions are injected in the mass spectrometer (operated in top-down ion selection mode) at any given time, low intensity ions, produced by low abundance or poorly ionizing peptides, are less likely to be selected for MS/MS sequencing (
      • Aebersold R.
      • Mann M.
      Mass spectrometry-based proteomics.
      ). Finally some peptides, due to their physical-chemical properties, cannot be efficiently ionized or fragment in an atypical way producing MS/MS spectra unidentifiable by the current database search tools. As a result, more than 30% of all proteins that are detected in a typical shotgun proteomic experiment, including many low molecular weight or low abundance proteins, are identified by a single peptide.

      ASSEMBLING PEPTIDES INTO PROTEINS

      Results of large scale proteomic experiments are often presented as lists of protein identifications. At present, significant inconsistencies exist in the way different research groups assign peptides to proteins and deal with biological and database redundancies. The criteria for calling a protein “identified” are not always described, and there is no generally accepted way to do it. Shared peptides (peptides present in more than one sequence database entry) are sometimes assigned to a particular protein among several possibilities in a random fashion. Different sequence database entries could be counted as separate protein identifications when in fact all of them share the same set of peptides and, therefore, are indistinguishable. In most cases not only do these redundancies inflate the total number of proteins reported as identified, but they can also lead to incorrect biological interpretation of the data. The problem is further complicated when no statistical analysis is performed to determine the validity of peptide and protein identifications (
      • Nesvizhskii A.I.
      • Aebersold R.
      Analysis, statistical validation and dissemination of large-scale proteomics datasets generated by tandem MS.
      ,
      • Carr S.
      • Aebersold R.
      • Baldwin M.
      • Burlingame A.
      • Clauser K.
      • Nesvizhskii A.
      The need for guidelines in publication of peptide and protein identification data.
      ). Thus, there is a need to develop a common nomenclature and a set of guidelines for assigning peptides to proteins and for interpreting resulting protein identification datasets.
      The nomenclature described below provides a consistent way for presenting the results of large scale proteomic experiments. In creating a protein summary list that accurately represents the data, various peptide grouping scenarios have to be considered that are schematically illustrated in Fig. 5 (
      • Nesvizhskii A.I.
      • Keller A.
      • Kolker E.
      • Aebersold R.
      A statistical model for identifying proteins by tandem mass spectrometry.
      ,
      • Yang X.
      • Dondeti V.
      • Dezube R.
      • Maynard D.M.
      • Geer L.Y.
      • Epstein J.
      • Chen X.
      • Markey S.P.
      • Kowalak J.A.
      DBParser: web-based software for shotgun proteomic data analyses.
      ). The diagram in Fig. 5a describes a case of two distinct proteins, A and B, each identified by distinct
      Also referred to as discrete peptides (see Ref.
      • Yang X.
      • Dondeti V.
      • Dezube R.
      • Maynard D.M.
      • Geer L.Y.
      • Epstein J.
      • Chen X.
      • Markey S.P.
      • Kowalak J.A.
      DBParser: web-based software for shotgun proteomic data analyses.
      ).
      peptides only, i.e. peptides corresponding to that one protein and no other proteins (peptides 1 and 2 are unique to protein A, and peptides 3 and 4 are unique protein B). Fig. 5b shows a case of two differentiable proteins, which are identified by at least one distinct peptide (peptide 1 is unique to A, and peptide 4 is unique to protein B) but also by one or more shared peptides (peptides 2 and 3 are shared between the two proteins). A different scenario is shown in Fig. 5c where all peptides are shared between proteins A and B. These two proteins are indistinguishable given the sequences of the identified peptides, and either protein A, protein B, or both can be present in the sample. Fig. 5, d and e, each show a situation where all identified peptides corresponding to protein B are shared and can be accounted for by another protein (protein A in Fig. 5d) or a combination of several other proteins (proteins A and C in Fig. 5e) certain to be in the sample because they are identified by at least one distinct peptide. In general, no conclusion can be made regarding the presence of a subset (protein B in Fig. 5d) or a subsumable (Protein B in Fig. 5e) protein in the sample. A special case is shown in Fig. 5f where all identified peptides are shared by a group of proteins. The presence of protein A in the sample is sufficient to explain all observed peptides (B and C are subset protein identifications). Although protein A is the most likely candidate, its presence in the sample is not required to explain the data; it is identified by shared peptides only. In the absence of protein A, a combination of proteins B and C would account for all four peptides. Such situations are often observed in the case of extended protein families, such as the tubulin example shown in Fig. 3. The examples discussed above are exhaustive, i.e. it should be possible to explain more complicated cases observed in real datasets by reducing them to a combination of several basic grouping scenarios.
      Figure thumbnail gr5
      Fig. 5Basic peptide grouping scenarios.a, distinct protein identifications. b, differentiable protein identifications. c, indistinguishable protein identifications. d, subset protein identification. e, subsumable protein identification. f, an example of a protein group where one protein can explain all observed peptides, but its identification is not conclusive.
      The nomenclature described here, coupled with the Occam’s razor constraint (
      • Nesvizhskii A.I.
      • Keller A.
      • Kolker E.
      • Aebersold R.
      A statistical model for identifying proteins by tandem mass spectrometry.
      ), would provide a minimal list of proteins sufficient to explain all observed peptides. Such a minimal list would contain all distinct and differentiable proteins, e.g. proteins A and B in Fig. 5, a and b, and proteins A and C in Fig. 5e but no subsumable or subset proteins, e.g. only protein A would be included in the list in the cases shown in Fig. 5, d and f. In the case of indistinguishable protein identifications, Fig. 5c, it would be most accurate to collapse all such identifications into a single entry in the protein summary report as there is often no basis to eliminate any of them.
      Presenting results of large scale shotgun experiments in terms of such minimal lists of protein identifications has several advantages. It significantly simplifies the interpretation of the data by allowing the user to focus on proteins that are conclusively determined to be present in the sample. It also allows calculation of a consistent measure for the number of proteins identified in the experiment as the smallest number of proteins that can explain all observed peptides (i.e. the number of entries in the minimal protein list).
      At the same time, presenting only the minimal list of proteins has limitations. For example, a researcher interested in a particular gene might want to observe all related protein isoforms annotated in the protein sequence database that are implicated by at least one peptide identified in the experiment. Moreover the strict implementation of the Occam’s razor approach can be misleading when applied to complex protein families. In the α-tubulin protein family example shown in Fig. 3, none of the identified peptides are unique to the tubulin α-1 protein. Thus, although this protein can explain all observed peptides, its identification is not conclusive. In fact, in the absence of the α-1 tubulin, all peptides can be accounted for by a combination of several other tubulins, e.g. α-3 and α-6. Because it is not possible to determine which particular member(s) of that family is present in the sample, in creating a minimal list it is more accurate and informative to present all members together as a group (
      • Nesvizhskii A.I.
      • Keller A.
      • Kolker E.
      • Aebersold R.
      A statistical model for identifying proteins by tandem mass spectrometry.
      ). Therefore, the most advantageous presentation would include the following: (a) a minimal list with indistinguishable proteins collapsed into a single entry (but showing all protein names) and with all members of protein groups listed and (b) means to observe the proteins implicated by at least one peptide that cannot be called conclusively identified. A simplified illustration of such a format of presentation is shown in Fig. 6.
      Figure thumbnail gr6
      Fig. 6A simplified example of a protein summary list. Peptides are apportioned among all their corresponding proteins, and the minimal list of proteins is derived that can explain all observed peptides. Proteins that are impossible to differentiate on the basis of identified peptides are collapsed into a single entry (F and G) or presented as a group (H, I, and J). Shared peptides are marked with an asterisk. Proteins that cannot be conclusively identified are shown at the end of the list but do not contribute toward the protein count.

      COMPUTATIONAL TOOLS

      A number of computational tools for assembling peptides into proteins in large scale shotgun proteomic experiments have been described (
      • Nesvizhskii A.I.
      • Keller A.
      • Kolker E.
      • Aebersold R.
      A statistical model for identifying proteins by tandem mass spectrometry.
      ,
      • Resing K.A.
      • Meyer-Arendt K.
      • Mendoza A.M.
      • Aveline-Wolf L.D.
      • Jonscher K.R.
      • Pierce K.G.
      • Old W.M.
      • Cheung H.T.
      • Russell S.
      • Wattawa J.L.
      • Goehle G.R.
      • Knight R.D.
      • Ahn N.G.
      Improving reproducibility and sensitivity in identifying human proteins by shotgun proteomics.
      ,
      • Yang X.
      • Dondeti V.
      • Dezube R.
      • Maynard D.M.
      • Geer L.Y.
      • Epstein J.
      • Chen X.
      • Markey S.P.
      • Kowalak J.A.
      DBParser: web-based software for shotgun proteomic data analyses.
      ,
      • Kislinger T.
      • Rahman K.
      • Radulovic D.
      • Cox B.
      • Rossant J.
      • Emili A.
      PRISM, a generic large scale proteomic investigation strategy for mammals.
      ,
      • Kristensen D.B.
      • Brond J.C.
      • Nielsen P.A.
      • Andersen J.R.
      • Sorensen O.T.
      • Jorgensen V.
      • Budin K.
      • Matthiesen J.
      • Veno P.
      • Jespersen H.M.
      • Ahrens C.H.
      • Schandorff S.
      • Ruhoff P.T.
      • Wisniewski J.R.
      • Bennett K.L.
      • Podtelejnikov A.V.
      Experimental Peptide Identification Repository (EPIR): an integrated peptide-centric platform for validation and mining of tandem mass spectrometry data.
      ). In general, the process of peptide assembly consists of the following steps. First, peptide assignments obtained by searching acquired MS/MS spectra against a protein sequence database using algorithms such as SEQUEST (
      • Eng J.K.
      • McCormack A.L.
      • Yates J.R.
      An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database.
      ) or Mascot (
      • Perkins D.N.
      • Pappin D.J.
      • Creasy D.M.
      • Cottrell J.C.
      Probability-based protein identification by searching sequence databases using mass spectrometry data.
      ) are filtered using a user-specified set of criteria to remove false identifications. Second, accession numbers and annotations of protein sequence database entries corresponding to each peptide are retrieved from the sequence database. Third, peptides are grouped by their corresponding sequence database entries. Fourth, shared peptides are apportioned among all corresponding proteins, and a summary protein list is created. Ideally the apportionment of peptides to proteins should be done using a probability-based approach, i.e. taking into account the probabilities of peptide assignments (
      • Nesvizhskii A.I.
      • Keller A.
      • Kolker E.
      • Aebersold R.
      A statistical model for identifying proteins by tandem mass spectrometry.
      ). This has an advantage in that it allows calculation of statistical confidence measures for protein identifications and estimation of false identification error rates resulting from filtering the data (
      • Nesvizhskii A.I.
      • Aebersold R.
      Analysis, statistical validation and dissemination of large-scale proteomics datasets generated by tandem MS.
      ,
      • Nesvizhskii A.I.
      • Keller A.
      • Kolker E.
      • Aebersold R.
      A statistical model for identifying proteins by tandem mass spectrometry.
      ).
      The format in which the results of shotgun proteomic experiments are presented to the user varies between the tools. In ProteinProphet (
      • Nesvizhskii A.I.
      • Keller A.
      • Kolker E.
      • Aebersold R.
      A statistical model for identifying proteins by tandem mass spectrometry.
      ), each separate entry in the protein summary file is assigned a probability that the corresponding protein is present in the sample. Indistinguishable proteins are collapsed into a single entry, and all members of protein groups, such as the α-tubulin family shown in Fig. 3, are presented together. All subset and subsumable protein entries are assigned zero probability, which is to be interpreted as the absence of conclusive evidence for the presence of those proteins in the sample. The subset and subsumable protein entries can be located and viewed using interactive web-based options. In the Experimental Peptide Identification Repository (EPIR) (
      • Kristensen D.B.
      • Brond J.C.
      • Nielsen P.A.
      • Andersen J.R.
      • Sorensen O.T.
      • Jorgensen V.
      • Budin K.
      • Matthiesen J.
      • Veno P.
      • Jespersen H.M.
      • Ahrens C.H.
      • Schandorff S.
      • Ruhoff P.T.
      • Wisniewski J.R.
      • Bennett K.L.
      • Podtelejnikov A.V.
      Experimental Peptide Identification Repository (EPIR): an integrated peptide-centric platform for validation and mining of tandem mass spectrometry data.
      ), the notion of protein groups introduced in Ref.
      • Nesvizhskii A.I.
      • Keller A.
      • Kolker E.
      • Aebersold R.
      A statistical model for identifying proteins by tandem mass spectrometry.
      is extended, and all entries with shared peptides are organized into a single group. The protein that contains most of the peptides is selected as an anchor, and all group members that are identified by at least one distinct peptide are marked as conclusively identified. Additional visualization tools, e.g. a tool for aligning the sequences of all proteins within a protein group, are provided to assist in the interpretation of the data. Other software tools such as Isoform Resolver (
      • Resing K.A.
      • Meyer-Arendt K.
      • Mendoza A.M.
      • Aveline-Wolf L.D.
      • Jonscher K.R.
      • Pierce K.G.
      • Old W.M.
      • Cheung H.T.
      • Russell S.
      • Wattawa J.L.
      • Goehle G.R.
      • Knight R.D.
      • Ahn N.G.
      Improving reproducibility and sensitivity in identifying human proteins by shotgun proteomics.
      ) and DBParser (
      • Yang X.
      • Dondeti V.
      • Dezube R.
      • Maynard D.M.
      • Geer L.Y.
      • Epstein J.
      • Chen X.
      • Markey S.P.
      • Kowalak J.A.
      DBParser: web-based software for shotgun proteomic data analyses.
      ) create protein summary lists containing all protein sequence database entries identified by at least one peptide with proteins that share a set of peptides placed adjacent to each other. In Isoform Resolver, the protein summary lists are presented in a text format, and a peptide-centric numbering scheme is used to specify what proteins are identified conclusively. DBParser outputs the results in an interactive web-based format that allows the user to view both the redundant and the minimal list of proteins.
      DTASelect (
      • Tabb D.L.
      • McDonald W.H.
      • Yates J.R.
      DTASelect and Contrast: tools for assembling and comparing protein identifications from shotgun proteomics.
      ) is another widely used tool for processing of shotgun proteomic data. However, it does not provide any statistical confidence measures for protein and peptide identifications, and its approach for assembling peptides into proteins in the presence of shared peptides has not been fully described. In addition, new tools are being developed at increasing speed, including commercial programs that combine the process of peptide identification and the subsequent assembly of peptides into proteins (
      • Allet N.
      • Barrillat N.
      • Baussant T.
      • Boiteau C.
      • Botti P.
      • Bougueleret L.
      • Budin N.
      • Canet D.
      • Carraud S.
      • Chiappe D.
      • Christmann N.
      • Colinge J.
      • Cusin I.
      • Dafflon N.
      • Depresle B.
      • Fasso I.
      • Frauchiger P.
      • Gaertner H.
      • Gleizes A.
      • Gonzalez-Couto E.
      • Jeandenans C.
      • Karmime A.
      • Kowall T.
      • Lagache S.
      • Mahe E.
      • Masselot A.
      • Mattou H.
      • Moniatte M.
      • Niknejad A.
      • Paolini M.
      • Perret F.
      • Pinaud N.
      • Ranno F.
      • Raimondi S.
      • Reffas S.
      • Regamey P.O.
      • Rey P.A.
      • Rodriguez-Tome P.
      • Rose K.
      • Rossellat G.
      • Saudrais C.
      • Schmidt C.
      • Villain M.
      • Zwahlen C.
      In vitro and in silico processes to identify differentially expressed proteins.
      ).
      SpectrumMill (www.chem.agilent.com).
      This diversity of computational tools, a positive development reflecting the increased used of shotgun proteomics, nevertheless presents a significant challenge for developing any kind of standards for the analysis and journal publication of proteomic datasets (
      • Nesvizhskii A.I.
      • Aebersold R.
      Analysis, statistical validation and dissemination of large-scale proteomics datasets generated by tandem MS.
      ,
      • Carr S.
      • Aebersold R.
      • Baldwin M.
      • Burlingame A.
      • Clauser K.
      • Nesvizhskii A.
      The need for guidelines in publication of peptide and protein identification data.
      ). It is thus essential that the computational tools are made transparent (published) and extensively tested and that the methods for assembling peptides into proteins and presenting the results all follow the same set of general guidelines such as those described in this article.

      PROTEIN SEQUENCE DATABASES

      Computational analysis and biological interpretation of shotgun proteomic data requires selection of a reference protein sequence database. For some organisms, e.g. human, several different databases exist that vary in terms of completeness, degree of redundancy, and quality of sequence annotation (
      • Apweiler R.
      • Bairoch A.
      • Wu C.H.
      Protein sequence databases.
      ). Table I and Fig. 7 summarize some of the existing protein sequence databases that are commonly used with mass spectrometry data. The choice of a particular database should be based on the goals of the experiment.
      Table ISummary of the protein sequence databases that are commonly used in shotgun proteomic analysis
      Database, date (version)Number of sequences; size of file (human)Description; source databasesOrganismsRelease; update frequency; maintained by
      Uni-Prot/Swiss-Prot, 02/15/200511,898; 7.8 MbExpertly curated; high level of annotation; minimum level of redundancy; high level of integration with other databases.ManyRelease every 4 months; updates every 2 weeks; EBI, SIB, Georgetown University
      Uni-Prot/TrEMBL, 02/15/200552,052; 23.3 MbComputer-annotated supplement to Uni-Prot/Swiss-Prot. Contains translated coding sequences from GenBankTM nucleotide database, protein sequences extracted from the literature or submitted to Uni-Prot/Swiss-Prot but not yet manually curated.ManyRelease every 4 months; updates every 2 weeks; EBI, SIB, Georgetown University
      RefSeq, 08/26/2004 (R 9)27,960; 17.7 MbOngoing curation by NCBI staff; non-redundant; explicitly linked nucleotide and protein sequences; stable reference; high level of integration with other databases.ManyRelease every ∼3 months; NCBI
      Ensembl, 02/2005 (version 28-35a)33,860; 21.1 MbCreated using automated genome annotation pipeline; eukaryotic genomes only; explicitly linked nucleotide and protein sequences; stable reference; high level of integration with other databases. Peptides identified by MS/MS can be mapped to the genome via Ensembl Protein database and visualized using Ensembl Genome Browser.16 organismsEvery 1–2 months; EBI and Wellcome Trust Sanger Institute
      IPI, 02/2005 (version 3.03)48,953; 28.9 MbGood balance between degree of redundancy and completeness; references to the primary data sources; attempts to maintain stable identifiers (with incremental versioning), but still in flux. Assembled from Uni-Prot (Swiss-Prot + TrEMBL), RefSeq, Emsembl, H-Invitational database.5 organismsMonthly; EBI
      Entrez Protein (NCBInr), 02/17/2005115,926; 58.5 MbMore complete with regard to sequence polymorphisms and splice forms; annotations extracted from curated databases; high degree of sequence redundancy makes interpretation difficult. Assembled from GenBankTM and RefSeq coding sequence translations, Protein Information Resource (PIR), Protein Data Bank (PDB), Uni-Prot/Swiss-Prot, Protein Research Foundation (PRF).ManyFrequent updates; NCBI
      Figure thumbnail gr7
      Fig. 7Protein sequence databases differ in terms of their completeness and the degree of sequence redundancy.A, the total number of tryptic peptides with no missed cleavages (Ntot) and the number of unique sequences among them (Nunique), in the range of molecular weights between 600 and 3000, in each of the human protein sequence databases listed in . B, a measure of the database sequence redundancy (average number of database entries containing each unique tryptic peptide sequence), estimated by taking the ratio Ntot/Nunique, plotted as a function of peptide molecular weight (bin size of 50 mass units) for the same databases. SP, Swiss-Prot.
      When peptides are assigned to MS/MS spectra using the database search approach, the universe of all potential peptide assignments is limited to the sequences present in the searched protein sequence database. The completeness of the sequence database thus can be a decisive factor in experiments where identification of sequence polymorphisms is crucial for the biological interpretation of the data. In those cases, a large database such as Entrez Protein (also known as the non-redundant NCBI database, NCBInr) (
      • Wheeler D.L.
      • Church D.M.
      • Edgar R.
      • Federhem S.
      • Helmberg W.
      • Madden T.L.
      • Pontius J.U.
      • Schuler G.D.
      • Schriml L.M.
      • Sequeira E.
      • Suzek T.O.
      • Tatusova T.A.
      • Wagner L.
      Database resources of the National Center for Biotechnology Information: update.
      ) would have an advantage over smaller databases such as Uni-Prot/Swiss-Prot (
      • Boeckmann B.
      • Bairoch A.
      • Apweiler R.
      • Blatter M.
      • Estreicher A.
      • Gasteiger E.
      • Martin M.J.
      • Michoud K.
      • O’Donovan C.
      • Phan I.
      • Pilbout S.
      • Schneider M.
      The Swiss-Prot protein knowledgebase and its supplement TrEMBL in 2003.
      ) or RefSeq (
      • Pruitt K.D.
      • Tatusova T.
      • Maglott D.R.
      NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins.
      ). The Entrez Protein database, for example, contains twice as many unique tryptic peptide sequences as Uni-Prot/Swiss-Prot (Fig. 7A). At the same time, large sequence databases contain, in addition to true biologically significant sequence variants, numerous artificial redundancies arising e.g. from partial mRNAs or sequencing errors (see example in Fig. 4). Fig. 7B plots the average number of database entries containing each unique tryptic (with no missed cleavages) peptide sequence as a function of peptide molecular weight. For example, in the range of molecular weights around 1000, the majority of tryptic peptides in the Swiss-Prot database are distinct (Ntot/Nunique ∼ 1), whereas in the Entrez Protein database each peptide is present on average in three different entries (Ntot/Nunique ∼ 3). In the absence of good sequence annotation in large protein sequence databases such as Entrez Protein database, it becomes necessary to perform time-consuming manual analysis and elimination of database redundancies. Furthermore searching such large databases makes it more difficult to separate the correct from random (incorrect) peptide assignments to MS/MS spectra.
      When the quality of the sequence annotation and the ease of data interpretation are more important than the ability to identify sequence variants, it is more appropriate to use well curated databases such as Swiss-Prot or RefSeq. A good balance between the completeness and the level of redundancy is found in the International Protein Index (IPI) database (
      • Kersey P.J.
      • Duarte J.
      • Williams A.
      • Karavidopoulou Y.
      • Birney E.
      • Apweiler R.
      The International Protein Index: an integrated database for proteomics experiments.
      ), which is available for a number of organisms including human and mouse. The sequence- and identifier-based construction of this database significantly reduces the need for manual filtering while maintaining cross-references to all its source data, which include Ensembl (
      • Birney E.
      • Andrews D.
      • Bevan P.
      • Caccamo M.
      • Cameron G.
      • Chen Y.
      • Clarke L.
      • Coates G.
      • Cox T.
      • Cuff J.
      • Curwen V.
      • Cutts T.
      • Down T.
      • Durbin R.
      • Eyras E.
      • Fernandez-Suarez X.M.
      • Gane P.
      • Gibbins B.
      • Gilbert J.
      • Hammond M.
      • Hotz H.
      • Iyer V.
      • Kahari A.
      • Jekosch K.
      • Kasprzyk A.
      • Keefe D.
      • Keenan S.
      • Lehvaslaiho H.
      • McVicker G.
      • Melsopp C.
      • Meidl P.
      • Mongin E.
      • Pettett R.
      • Potter S.
      • Proctor G.
      • Rae M.
      • Searle S.
      • Slater G.
      • Smedley D.
      • Smith J.
      • Spooner W.
      • Stabenau A.
      • Stalker J.
      • Storey R.
      • Ureta-Vidal A.
      • Woodwark C.
      • Clamp M.
      • Hubbard T.
      Ensembl 2004.
      ), Uni-Prot (Swiss-Prot and its supplement TrEMBL) (
      • Boeckmann B.
      • Bairoch A.
      • Apweiler R.
      • Blatter M.
      • Estreicher A.
      • Gasteiger E.
      • Martin M.J.
      • Michoud K.
      • O’Donovan C.
      • Phan I.
      • Pilbout S.
      • Schneider M.
      The Swiss-Prot protein knowledgebase and its supplement TrEMBL in 2003.
      ), and RefSeq (
      • Pruitt K.D.
      • Tatusova T.
      • Maglott D.R.
      NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins.
      ). Minor sequence variants, however, are not represented in the IPI database.
      Genomic databases can also be used for MS/MS database searching (
      • Kuster B.
      • Mortensen P.
      • Andersen J.S.
      • Mann M.
      Mass spectrometry allows direct identification of proteins in large genomes.
      ,
      • Choudhary J.S.
      • Blackstock W.P.
      • Creasy D.M.
      • Cottrell J.S.
      Interrogating the human genome using uninterpreted mass spectrometry data.
      ), which can lead to the identification of novel alternative splice forms and sequence polymorphisms not present in the protein sequences databases. However, this type of computational analysis can be computer-intensive because of the large size of those databases. It is also complicated due to frameshifts, incorrectly predicted open reading frames, and poor quality of many EST sequences. This combined with the poor quality of many experimental MS/MS spectra can lead to high numbers of false identifications. A more efficient strategy for the identification of novel alternative splice forms or sequence variants is to perform computational analysis in an iterative fashion. In this approach, the analysis would start with searching MS/MS spectra against a well annotated database (e.g. RefSeq or IPI). The high quality spectra left unassigned in the initial search are then reanalyzed more extensively, first searching for post-translationally modified peptides and only then against large genomic databases.
      A. I. Nesvizhskii et al., manuscript in preparation.
      An important caveat to keep in mind when interpreting shotgun proteomic data is that the protein sequence databases are constantly in flux especially with regard to minor sequence variants, alternative splice forms, and other less well characterized gene products. With each new database update, some protein sequences disappear, the annotation and accession numbers of the remaining sequences can change, and new sequences can be added. The instability of the current sequence databases is largely due to a substantial amount of work being carried out to improve their completeness and the quality of sequence annotation, a process that is likely to continue for a significant period of time. This has significant implications in that interpretation of the MS/MS-based proteomic data, e.g. assignment of peptides to entries in the protein sequence database and conclusions about the presence of a particular protein isoform in the sample, depends on the version of the protein sequence database used in the analysis. Frequent updating of the sequence databases by the database providers can complicate ongoing proteomic experiments. Researchers using these databases in the analysis of their data often have to reanalyze previously acquired and processed MS/MS spectra using a new version of the database or develop bioinformatic tools for automated mapping of peptide sequences, identified by searching MS/MS spectra against an older version of the database, to the latest version of that database. It is important to note that the data coming from MS-based proteomic experiments can itself be used to assist in the process of improving protein sequence databases provided a mechanism is developed for communicating the sequences of peptides identified by searching those databases back to the database developers and annotators (
      • Mann M.
      • Pandey A.
      Use of mass spectrometry-derived data to annotate nucleotide and protein sequence databases.
      ,
      • Desiere F.
      • Deutsch E.W.
      • Nesvizhskii A.I.
      • Mallick P.
      • King N.L.
      • Eng J.K.
      • Aderem A.
      • Boyle R.
      • Brunner E.
      • Donohoe S.
      • Fausto N.
      • Hafen E.
      • Hood L.
      • Katze M.G.
      • Kennedy K.A.
      • Kregenow F.
      • Lee H.
      • Lin B.
      • Martin D.
      • Ranish J.A.
      • Rawlings D.J.
      • Samelson L.E.
      • Shiio Y.
      • Watts J.D.
      • Wollscheid B.
      • Wright M.E.
      • Yan W.
      • Yang L.
      • Yi E.C.
      • Zhang H.
      • Aebersold R.
      Integration with the human genome of peptide sequences obtained by high-throughput mass spectrometry.
      ).

      IDENTIFICATION OF MATURE FORMS OF PROTEINS

      The discussion so far has mostly focused on the problem of assigning peptides to proteins and distinguishing between different protein forms whose sequences are present in the protein sequence database. A closely related issue is the difficulty of using shotgun proteomic data to provide conclusive information regarding the mature form of the sample proteins. First, most existing protein sequence databases contain entries that are derived from full cDNAs encoding preprocessed forms. Thus, they do not typically contain the mature forms derived from various post-translational processing mechanisms, e.g. removal of the leading methionine, cleavage of the signal or transit peptide, etc. Second, even if all mature forms were annotated in the protein sequence database, distinguishing between different protein isoforms would be difficult. For example, a mere observation that none of the identified peptides are coming from the N-terminal region of the protein does not necessarily indicate the cleavage of the presequence. It can be explained by other factors, e.g. the absence of identifiable tryptic peptides in that region.
      In some cases, post-translational processing events can be inferred using the knowledge regarding the specificity of the proteolytic enzyme used to digest proteins into peptides. For example, the enzyme trypsin cleaves after arginine and lysine residues. A peptide resulting from trypsin digestion should contain Lys or Arg at its C terminus (unless it is located at the C terminus of the protein), and in the sequences of its corresponding protein the residue immediately preceding the peptide should also be Lys or Arg (or the peptide is located at the N terminus). Thus, identification of a peptide whose sequence does not adhere to the enzymatic digestion constraint at one of its termini could indicate that the mature form of the protein is present in the sample. One such example is shown in Fig. 8A where identification of a “partially tryptic” peptide (not tryptic at its N terminus), assigned to the “Basigin precursor” database entry (Swiss-Prot accession number P35613), suggests that the mature form of that protein resulting from the proteolytic cleavage of the 22-residue-long signal peptide is present in the biological sample. In general, identifying peptides that are not tryptic (assuming no protein cleavage) and are located close to the N terminus of the protein can be a useful strategy for inferring signal peptide cleavage sites or other proteolytic cleavage events, thus confirming, refining, or adding to the annotations currently available in the protein databases such as Swiss-Prot. In some cases, it can also assist in discrimination between different protein isoforms resulting from alternative splicing. It should be noted, however, that some partially tryptic peptides can be observed due to in-source or in-solution fragmentation of the originally tryptic peptides. Thus, conclusions based on the observation of partially tryptic peptides require additional scrutiny.
      Figure thumbnail gr8
      Fig. 8Identification of mature forms of proteins in shotgun proteomics.A, identification of a partially tryptic peptide, AAGTVFTTVEDLGSK, indicates the removal of the 22-residue-long signal peptide. B, identification of two partially tryptic peptides, RPLVASVGLNVPASVCY and SHTDIKVPDFSEYR, located adjacent to each other in the protein sequence indicates the cleavage of the 78-amino acid presequence of the ubiquinol-cytochrome c reductase iron-sulfur subunit. The processed presequence remains as a subunit of a protein complex.
      Another example is shown in Fig. 8B where several peptides were identified and assigned to a single protein sequence database entry “ubiquinol-cytochrome c reductase iron-sulfur subunit, mitochondrial precursor” (Swiss-Prot accession number P47985, Rieske protein). Two of the peptides, RPLVASVGLNVPASVCY and SHTDIKVPDFSEYR, are partially tryptic and located adjacent to each other in the protein sequence, which suggests the cleavage of the 78-amino acid presequence (annotated in Swiss-Prot as “transit” peptide). Interestingly the identification of several peptides assigned to the N-terminal region of the protein indicates that the presequence has not been degraded. This observation is consistent
      Validation of the biological significance of this observation would require additional analysis to eliminate the possibility that the cleavage of the protein occurred at the digestion stage because of non-biological reasons, e.g. chymotrypsin-like secondary activity of trypsin.
      with the results of previous studies suggesting that the Rieske protein in the mammalian systems is processed in a single proteolytic step after it becomes associated with the cytochrome bc1 complex and that the processed presequence remains as a subunit of the complex (
      • Brandt U.
      • Yu L.
      • Yu C.A.
      • Trumpower B.L.
      The mitochondrial targeting presequence of the Rieske iron-sulfur protein is processed in a single step after insertion into the cytochrome bc1 complex in mammals and retained as a subunit in the complex.
      ).
      The strategy for the detection of proteolytic cleavage events, described above, relies on the identification of the N- and C-terminal peptides. However, in shotgun analysis of complex protein mixtures, the protein coverage (the number of identified peptides per protein) is typically low especially in the case of low abundance proteins. Thus, such events would only be detected for a fraction of all proteins, typically those of high abundance. The efficiency of the method can be improved by using targeted protein identification strategies designed to increase the likelihood of identifying N- and C-terminal peptides. One such strategy is based on isolation of N-terminal peptides from in vivo N-terminus-blocked proteins using fractional diagonal chromatography (
      • Gevaert K.
      • Goethais M.
      • Martens L.
      • Van Damme J.
      • Staes A.
      • Thomas G.R.
      • Vandekerckhove J.
      Exploring proteomes and analyzing protein processing by mass spectrometric identification of sorted N-terminal peptides.
      ). The method can be further improved by optimizing the computational MS/MS data interpretation strategies to specifically look for peptides indicative of the proteolytic cleavage (
      • Song H.
      • Hecimovic S.
      • Goate A.
      • Hsu F.F.
      • Bao S.
      • Vidavsky I.
      • Ramanadham S.
      • Turk J.
      Characterization of N-terminal processing of group VIA phospholipase A2 and of potential cleavage sites of amyloid precursor protein constructs by automated identification of signature peptides in LC/MS/MS analyses of proteolytic digests.
      ).

      QUANTITATIVE PROTEOMICS

      Mass spectrometry is increasingly used not only for the identification of proteins but also for their quantification (quantitative proteomics) (for recent reviews, see Refs.
      • Aebersold R.
      • Mann M.
      Mass spectrometry-based proteomics.
      and
      • Zhang H.
      • Yan W.
      • Aebersold R.
      Chemical probes and tandem mass spectrometry: a strategy for the quantitative analysis of proteomes and subproteomes.
      ,
      • Julka S.
      • Regnier F.
      Quantification in proteomics through stable isotope coding: a review.
      ,
      • Goshe M.B.
      • Smith R.D.
      Stable isotope-coded proteomic mass spectrometry.
      ). The two problems are interdependent and in fact complementary, e.g. the quantitative information can be used to resolve some of the peptide grouping ambiguities.
      Although methods are being developed for the determination of absolute protein abundance levels (
      • Gerber S.A.
      • Rush J.
      • Stemman O.
      • Kirschner M.W.
      • Gygi S.P.
      Absolute quantification of proteins and phosphoproteins from cell lysates by tandem MS.
      ,
      • Aebersold R.
      Constellations in a cellular universe.
      ,
      • Kuster B.
      • Schirle M.
      • Mallick P.
      • Aebersold R.
      ), most current quantitative proteomic experiments are based on the determination of relative protein expression levels between two or more different pools of proteins. In the most straightforward application, the quantitative proteomics is used as the equivalent of the microarray gene expression profiling approach (
      • Schena M.
      Microarray Analysis.
      ) except that the measurement is performed at the protein, rather than mRNA, level. The shotgun proteomic approach can be made quantitative by applying stable isotope labeling of proteins or peptides. This is illustrated in Fig. 9A using the most common case of a two-sample comparison. The compared samples can represent two different cell states (e.g. before and after a perturbation) or cells grown under different conditions. The proteins are labeled separately with either light (sample 1) or heavy (sample 2) stable isotopes. The labeling can be done in a number of ways, e.g. chemically (ICAT, iTRAQ, etc.) or metabolically (e.g. SILAC) (for reviews, see Refs.
      • Aebersold R.
      • Mann M.
      Mass spectrometry-based proteomics.
      and
      • Zhang H.
      • Yan W.
      • Aebersold R.
      Chemical probes and tandem mass spectrometry: a strategy for the quantitative analysis of proteomes and subproteomes.
      ,
      • Julka S.
      • Regnier F.
      Quantification in proteomics through stable isotope coding: a review.
      ,
      • Goshe M.B.
      • Smith R.D.
      Stable isotope-coded proteomic mass spectrometry.
      ). Proteins from both samples are mixed and enzymatically digested into peptides. Labeled peptides are separated and subjected to sequencing and quantification using mass spectrometry. Peptides are identified from MS/MS spectra as described previously, and the quantitative information is extracted either from MS spectra (e.g. in ICAT- or SILAC-based quantitative methods) or directly from MS/MS spectra (iTRAQ) using software tools specifically developed for that purpose (
      • Han D.K.
      • Eng J.
      • Zhou H.
      • Aebersold R.
      Quantitative profiling of differentiation-induced microsomal proteins using isotope-coded affinity tags and mass spectrometry.
      ,
      • Li X.J.
      • Zhang H.
      • Ranish J.A.
      • Aebersold R.
      Automated statistical analysis of protein abundance ratios from data generated by stable-isotope dilution and tandem mass spectrometry.
      ,
      • MacCoss M.J.
      • Wu C.C.
      • Liu H.
      • Sadygov R.
      • Yates J.R.
      A correlation algorithm for the automated quantitative analysis of shotgun proteomics data.
      ,
      • Halligan B.D.
      • Slyper R.Y.
      • Twigger S.N.
      • Hicks W.
      • Olivier M.
      • Greene A.S.
      ZoomQuant: an application for the quantitation of stable isotope labeled peptides.
      ). Quantification is based on measuring relative ion intensity of heavy and light labeled peptide ions. Relative abundances of peptides between the two samples are then combined to compute the relative protein abundances. In addition to global protein profiling experiments, the same quantitative strategy can be used in a targeted way, e.g. for distinguishing members of macromolecular complexes or cell organelles from nonspecifically co-purifying proteins (
      • Ranish J.A.
      • Yi E.C.
      • Leslie D.M.
      • Purvine S.O.
      • Goodlett D.R.
      • Eng J.
      • Aebersold R.
      The study of macromolecular complexes by quantitative proteomics.
      ,
      • Foster L.J.
      • De Hoog C.L.
      • Mann M.
      Unbiased quantitative proteomics of lipid rafts reveals high specificity for signaling factors.
      ,
      • Marelli M.
      • Smith J.J.
      • Jung S.
      • Yi E.
      • Nesvizhskii A.I.
      • Christmas R.H.
      • Saleem R.A.
      • Tam Y.Y.C.
      • Faragasanu A.
      • Goodlett D.R.
      • Aebersold R.
      • Rachubinski R.A.
      • Aitchison J.D.
      Quantitative mass spectrometry reveals a role for the GTPase Rho1p in actin organization on the peroxisome membrane.
      ,
      • Gingras A.C.
      • Aebersold R.
      • Raught B.
      Advances in protein complex analysis using mass spectrometry.
      ). It should also be mentioned that although the discussion here is centered on the quantitative proteomic approach based on isotopic labeling, it applies equally to semiquantitative methods based on simple peptide counts (
      • Liu H.
      • Sadygov R.G.
      • Yates J.R.
      A model for random sampling and estimation of relative protein abundances in shotgun proteomics.
      ,
      • Blondeau F.
      • Ritter B.
      • Allaire P.D.
      • Wasiak S.
      • Girard M.
      • Hussain N.K.
      • Angers A.
      • Legendre-Guillemin V.
      • Roy L.
      • Boismenu D.
      • Kearney R.E.
      • Bell A.W.
      • Bergeron J.J.
      • McPherson P.S.
      Tandem MS analysis of brain clathrin-coated vesicles reveals their critical involvement in synaptic vesicle recycling.
      ) or on peptide ion current profiling (
      • Chelius D.
      • Bondarenko P.V.
      Quantitative profiling of proteins in complex mixtures using liquid chromatography and mass spectrometry.
      ,
      • Wang W.
      • Zhou H.
      • Lin H.
      • Roy S.
      • Shaler T.A.
      • Hill L.R.
      • Norton S.
      • Kumar P.
      • Anderle M.
      • Becker C.H.
      Quantification of proteins and metabolites by mass spectrometry without isotopic labeling or spiked standards.
      ).
      The issues discussed here are relevant to non-mass spectrometry-based protein quantification methods as well. For example, confirmation by Western blots can be equally misleading especially if anti-peptide antibodies are used and the peptide is shared.
      Figure thumbnail gr9
      Fig. 9Quantitative shotgun proteomic analysis using stable isotopes.A, in one quantitative method (ICAT), proteins are labeled using light or heavy mass tags and then digested into peptides. Labeled peptides are captured and sequenced using tandem mass spectrometry. Peptides are identified from MS/MS spectra using database searching and used to infer which proteins are present in the sample. Relative abundances of peptides between the compared samples are extracted from MS data, and then the relative protein abundance ratios are computed based on the ratios of observed peptides. The relative abundance ratio of a distinct peptide is a direct measure of the abundance ratio of its protein (for peptide 1, protein A; for peptide 3, protein B; and for peptide 4, protein C), whereas it is a weighted average of the abundance ratios of all its corresponding proteins in the case of a shared peptide (peptides 2 and 5). B, connection between the relative quantification observed at peptide and protein levels. Distinct peptides 1 and 3 directly measure the relative protein abundance ratios of their corresponding proteins A and B, RA and RB. The relative abundance ratio of the shared peptide 2, r2, can be anywhere between the protein ratios RA and RB depending on the absolute abundances of A and B. Quantitative information can be used to resolve some cases of shared peptides. If peptides 4 and 5 have significantly different ratios r4 and r5, it can be explained by the presence of protein D is the sample.
      The relative protein abundance ratios between the compared samples are computed based on the ratios of observed peptides. For a distinct peptide, its relative abundance ratio is a direct measure of the abundance ratio of its corresponding protein.
      This is not entirely correct because peptides can be differentially modified (e.g. phosphorylated) under different conditions.
      In contrast, the relative abundance ratio in the case of a shared peptide is a weighted average of the abundance ratios of all its corresponding proteins with the weighting factors being determined by the absolute abundance of those proteins in the samples. This is illustrated in Fig. 9A where two differentiable proteins, A and B, are inferred to be present in the samples based on the identification of three peptides (proteins C and D are discussed later in this section). In this example, one of the peptides (peptide 2) is shared between the two proteins, and the other two peptides (peptides 1 and 3) are unique to protein A or B, respectively. The relative protein abundance ratios of these proteins, RA and RB, can be measured using the relative abundance ratios of their distinct peptides, r1 and r3, respectively (see Fig. 9B). The relative abundance ratio of the shared peptide 2, r2, can be anywhere between the protein ratios RA and RB depending on the absolute abundances of A and B in both samples that are being compared, NA, NB (sample 1) and NA′, NB′ (sample 2).
      An example of this kind is shown in Fig. 10. In that experiment, lipid rafts were isolated from both control and stimulated Jurkat human T cells, and the protein samples were quantitatively compared using the ICAT method (
      • Von Haller P.D.
      • Yi E.
      • Donohoe S.
      • Vaughn K.
      • Keller A.
      • Nesvizhskii A.I.
      • Eng J.
      • Li X.J.
      • Goodlett D.R.
      • Aebersold R.
      • Watts J.D.
      The application of new software tools to quantitative protein profiling via ICAT and tandem mass spectrometry: II. Evaluation of tandem mass spectrometry methodologies for large-scale protein analysis and the application of statistical tools for data analysis and interpretation.
      ). A number of peptides were identified that are shared between several members of the guanine nucleotide-binding protein (G protein) family, including α inhibiting activity polypeptides 1, 2, and 3. Isotopically labeled peptides for which quantitative information is available (Cys-containing ICAT-labeled peptides) are shown in Fig. 10. The identification of Gi α3 and α2 proteins was also supported by several additional unlabeled distinct peptides for which no quantitative information is available (sequences not shown). The quantification of the protein Gi α3 was based on one distinct ICAT-labeled peptide that was found to be present at higher abundance in the stimulated sample compared with the control sample (relative peptide abundance ratio close to 2:1). At the same time, quantification of Gi α2 was based on five distinct ICAT-labeled peptides showing no significant difference in their abundances between the two samples (average relative abundance ratio close to 1:1). Thus, although protein quantification based on a single distinct peptide should be interpreted with caution, it appears likely that these two members of the same gene family exhibited a different response to the external stimulation with Gi α3 being up-regulated and Gi α2 not changing. Interestingly the relative abundance ratios of the other two ICAT-labeled peptides that were shared between Gi α3 and Gi α2 were much closer to that of peptides unique to Gi α2 (the shared peptides are also present in another member of the gene family, Gi α1, but no distinct peptides were identified that would suggest the presence of that protein in the sample). This indicates that the absolute abundance level of the protein Gi α2 was greater than that of Gi α3 in agreement with a rough protein abundance measure such as the number of matched MS/MS spectra, 79 versus 27, determined for Gi α2 and Gi α3 proteins, respectively.
      Figure thumbnail gr10
      Fig. 10Identification and quantification of a group of peptides shared between several members of the guanine nucleotide-binding protein (G protein) family, α inhibiting activity polypeptides 1, 2, and 3 (Swiss-Prot accession numbers P04898, P04899, and P08754).
      Quantitative information can therefore be used to resolve some cases of shared peptides or suggest the presence of multiple protein isoforms having a different biological function. This is again illustrated in Fig. 10 where protein C is identified by peptides 4 and 5 having relative peptide abundance ratios r4 and r5, respectively. Peptide 5 is also present in protein D. Because there are no distinct peptides in the dataset that correspond to protein D, it is not possible to conclude that this protein is present in the sample given the sequences of identified peptides alone (subset protein identification). At the same time, if protein D is in the sample, its presence would be reflected in the relative abundance ratio of the shared peptide 5, whereas the relative abundance ratio of the distinct peptide 4 would always be determined solely by the relative abundance of protein C. Thus, significantly different ratios r4 and r5 would indicate the presence of protein D in the sample. Note that the reverse is not necessary true, i.e. observation of consistent peptide ratios does not rule out the presence of protein D because it can simply reflect a significantly lower abundance level of protein D compared with protein C.
      Furthermore observation of peptides with inconsistent relative abundance ratios when all peptides appear to be distinct (according to the protein sequence database used in the analysis) can point to the presence of a novel biologically significant protein form (e.g. novel splice variant, product of protein degradation, etc.). One such interesting example has been noticed recently in a quantitative proteomic study concerned with the identification of a human transcription factor using an ICAT proteomic approach (
      • Himeda C.L.
      • Ranish J.A.
      • Angello J.C.
      • Maire P.
      • Aebersold R.
      • Hauschka S.D.
      Quantitative proteomics identification of Six4 as the Trex-binding factor in the muscle creatine kinase enhancer.
      ). Among six identified peptides that were assigned to cellular nucleic acid-binding protein (CNBP), three peptides from the N terminus of the protein had an average relative abundance ratio (enriched sample versus control) of less than 3:1, whereas the other three peptides derived from the C-terminal portion of the protein had ratios of more than 7:1. Thus, it has been suggested that two different forms of CNBP (or CNBP and its homologue) are present in the sample. In other cases, inconsistencies in the relative peptide abundance ratio can be due to post-translational modification of the protein, e.g. if one of the peptides is phosphorylated and its abundance (in the unmodified form) is different in the compared samples.
      A close connection between the problem of assembling peptides into proteins and determining protein abundance ratios suggests a new integrated approach for dealing with quantitative proteomic data. At present, these two tasks are performed separately with the protein ratios computed using peptide ratios and the apportionment of shared peptides among their corresponding proteins performed independently of the quantitative data. Instead the apportionment of shared peptide and creation of the protein summary lists can be made dependent on the quantitative information observed at the peptide and protein level. This should enhance the interpretation of the data by resolving some of the ambiguities discussed above. However, such an approach would require high quality quantitative proteomic data. At present, the accuracy of relative peptide abundance ratios extracted from mass spectra using automated software tools often requires manual validation. This is especially true in the case of peptide “outliers,” i.e. peptides whose relative abundance ratios are significantly different from the ratios observed for other peptides assigned to the same protein, which are of utmost interest in the context of this discussion. The development of such integrated tools is an imminent task for shotgun proteomics.

      INTEGRATION OF PROTEOMIC AND TRANSCRIPTIONAL DATA

      Quantitative MS/MS-based proteomic analysis and DNA microarray analysis are two complementary technologies that measure gene expression at the protein and RNA levels, respectively. Due to its technically more advanced stage, the microarray technology (
      • Schena M.
      Microarray Analysis.
      ) allows monitoring of RNA expression levels for the number of genes that is significantly larger than the number of proteins that can be accurately identified and quantified in a typical proteomic experiment, and it can be effectively used for the analysis of alternative splicing and genome annotation (
      • Lee C.
      • Roy M.
      Analysis of alternative splicing with microarrays: successes and challenges.
      ,
      • Johnson J.M.
      • Edwards S.
      • Shoemaker D.
      • Schadt E.E.
      Dark matter in the genome: evidence of widespread transcription detected by microarray tiling experiments.
      ). However, due to post-transcriptional regulatory mechanisms such as protein translation, post-translational modifications, and degradation, the microarray measurements of mRNA expression patterns alone are not sufficient for understanding protein expression and function (
      • Gygi S.P.
      • Rochon Y.
      • Franza B.R.
      • Aebersold R.
      Correlation between protein and mRNA abundance in yeast.
      ,
      • Chen G.
      • Gharib T.G.
      • Huang C.C.
      • Taylor J.M.
      • Misek D.E.
      • Kardia S.L.R.
      • Giodano T.J.
      • Iannettoni M.D.
      • Orringer M.B.
      • Hanash S.M.
      • Beer D.G.
      Discordant protein and mRNA expression in lung adenocarcinomas.
      ). Thus, by combining transcriptional and proteomic analysis of the same samples, it becomes possible to achieve a better understanding of complex biological systems. A number of integrative proteomic and transcriptional analyses have been recently performed, including studies on model organisms and mammalian cells and tissues (
      • Griffin T.J.
      • Gygi S.P.
      • Ideker T.
      • Rist B.
      • Eng J.
      • Hood L.
      • Aebersold R.
      Complementary profiling of gene expression at the transcriptome and proteome levels in Saccharomyces cerevisiae.
      ,
      • Tian Q.
      • Stepaniants S.
      • Mao M.
      • Weng L.
      • Feetham M.C.
      • Doyle M.J.
      • Yi Y.C.
      • Dai H.
      • Thorsson V.
      • Eng J.
      • Goodlett D.
      • Berger J.P.
      • Gunter B.
      • Linseley P.S.
      • Stoughton R.B.
      • Aebersold R.
      • Collins S.J.
      • Hanlon W.A.
      • Hood L.E.
      Integrated genomic and proteomics analyses of gene expression in mammalian cells.
      ,
      • McRedmond J.P.
      • Park S.D.
      • Reilly D.F.
      • Coppinger J.A.
      • Maguire P.B.
      • Shields D.C.
      • Fitzgerald D.J.
      Integration of proteomics and genomics in platelets: a profile of platelet proteins and platelet-specific genes.
      ,
      • Maziarz M.
      • Chung C.
      • Drucker D.J.
      • Emili A.
      Integrating global proteomics and genomic expression profiles generated from islet α cells: opportunities and challenges to deriving reliable biological inferences.
      ). A recent review on the subject of integrating microarray and proteomic data can be found in Ref.
      • Cox B.
      • Kislinger T.
      • Emili A.
      Integrating gene and protein expression data: pattern analysis and profile mining.
      , and the discussion here will be limited to the issues related to the protein inference problem.
      Integration of different data types requires a good understanding of the underlying technologies and their limitations. Although a detailed review of the microarray technology goes beyond the scope of this article, it is interesting to note that many of the difficulties discussed here in the context of quantitative MS-based proteomic experiments are also present in the analysis of gene expression using microarrays. Unlike quantitative shotgun proteomics, in the case of oligonucleotide arrays the sequences of DNA probes present on the array are known in advance (the sequences of peptide “probes” in shotgun proteomics are determined from the spectra). Still ambiguities remain in connecting DNA probes to the target mRNAs (
      • Gautier L.
      • Mooller M.
      • Friis-Hansen L.
      • Knudsen S.
      Alternative mapping of probes to genes for Affymetrix chips.
      ). For example, multiples probes can map to the same gene; the same probe can map to different products of the same gene or even to multiple genes. Multiple probes mapping to the same gene can produce significantly different expression ratios; outliers might indicate the presence of several alternative splice forms, but they could also be a result of inaccurate quantification (
      • Lee C.
      • Roy M.
      Analysis of alternative splicing with microarrays: successes and challenges.
      ). Furthermore cross-hybridization, i.e. binding of the labeled RNA to non-target homologous probe sequence, introduces additional errors (
      • Flikka K.
      • Yadetie F.
      • Laegreid A.
      • Jonassen I.
      XHM: a system for detection of potential cross-hybridizations in DNA microarrays.
      ).
      Integration of proteomic and transcriptional data is hindered by lack of relevant annotations and the use of different accessioning schemes. The information available for each probe present on an Affymetrix chip, for example, includes an arbitrary identification number, the GenBankTM accession number of the target RNA sequence, and brief functional annotation. In the case of MS-based proteomics, experimental MS/MS spectra are assigned peptides, and then peptides are assembled into proteins using a variety of protein sequence databases. Each protein sequence database has its unique accessioning scheme, and the degree of sequence annotation does not always allow easy cross-reference between different protein sequence databases or between protein and genomic sequence databases.
      Correlating mRNA and protein data can be facilitated by selecting a well annotated database, e.g. UniGene, as a common reference (
      • Pontius J.U.
      • Wagner L.
      • Schuler G.D.
      UniGene: a unified view of the transcriptome.
      ) (Fig. 11). The UniGene database is created by an automated partitioning of GenBankTM sequences into a non-redundant set of gene-oriented clusters with each cluster containing sequences representing a unique gene. A number of tools have been described recently that can link the probes from Affymetrix arrays to the UniGene cluster identifiers (
      • Liu G.
      • Loraine A.E.
      • Shigeta R.
      • Cline M.
      • Cheng J.
      • Valmeekam V.
      • Sun S.
      • Kulp D.
      • Siani-Rose M.A.
      NetAffx: Affymetrix probesets and annotations.
      ). In turn, MS-derived protein identification datasets can be related to the UniGene clusters using the known connection between the RefSeq protein sequence database and UniGene. A tool for direct mapping of Affymetrix probes to RefSeq sequences has also been described (
      • Gautier L.
      • Mooller M.
      • Friis-Hansen L.
      • Knudsen S.
      Alternative mapping of probes to genes for Affymetrix chips.
      ).
      Figure thumbnail gr11
      Fig. 11Integration of proteomic and transcriptional data. mRNA and proteomic data can be linked using a common reference database such as UniGene.
      Although UniGene and RefSeq can provide a common reference for connecting proteomic and transcriptional data, a one-to-one correspondence will not always be possible. For example, some DNA probes cannot be linked to any UniGene clusters because their target sequences have been removed from the latest version of GenBankTM or deemed to be redundant and excluded from the UniGene build process. Furthermore in many proteomic studies, proteins are identified by searching MS/MS spectra against more complete protein sequence databases than RefSeq, e.g. IPI or Entrez Protein. Connecting protein sequences that are not annotated in RefSeq to the UniGene clusters is not straightforward. One of the main difficulties again comes from alternative splice forms. Without the ability to resolve different alternative splice forms, both on the part of proteomic and transcriptional analyses, the association between the two data types is not unique. As a result, the integration and correlation between proteomic and transcriptional data in some cases can be performed only at the gene level with mRNA and protein expression ratios averaged over multiple products of the same gene. Despite these difficulties, integrated analysis of mRNA and protein data can provide very valuable insights into complex biological systems.

      INTEGRATION OF MULTIPLE SHOTGUN PROTEOMIC DATASETS AND GENE-CENTERED DATA INTERPRETATION

      The discussion so far has been limited to the analysis and interpretation of the data generated in a “single” experiment where all MS/MS data are acquired on a particular biological sample of interest. However, due to technical limitations of current proteomic technologies, in any given large scale proteomic experiment only a subset of the entire proteome is identified. In repeated analysis of the same type, the cumulative number of identified peptides and proteins quickly reaches a saturation point. A more comprehensive characterization of the entire proteome can be achieved by combining the data from multiple diverse experiments (different tissues or cell types, enrichment schemes, etc.) (
      • Mann M.
      • Pandey A.
      Use of mass spectrometry-derived data to annotate nucleotide and protein sequence databases.
      ,
      • Desiere F.
      • Deutsch E.W.
      • Nesvizhskii A.I.
      • Mallick P.
      • King N.L.
      • Eng J.K.
      • Aderem A.
      • Boyle R.
      • Brunner E.
      • Donohoe S.
      • Fausto N.
      • Hafen E.
      • Hood L.
      • Katze M.G.
      • Kennedy K.A.
      • Kregenow F.
      • Lee H.
      • Lin B.
      • Martin D.
      • Ranish J.A.
      • Rawlings D.J.
      • Samelson L.E.
      • Shiio Y.
      • Watts J.D.
      • Wollscheid B.
      • Wright M.E.
      • Yan W.
      • Yang L.
      • Yi E.C.
      • Zhang H.
      • Aebersold R.
      Integration with the human genome of peptide sequences obtained by high-throughput mass spectrometry.
      ,
      • McGowan S.J.
      • Terrett J.
      • Brown C.G.
      • Adam P.J.
      • Aldridge L.
      • Allen J.C.
      • Amess B.
      • Andrews K.A.
      • Barnes M.
      • Barnwell D.E.
      • Berry J.
      • Bird H.
      • Boyd R.S.
      • Broughton M.J.
      • Brown A.
      • Bruce J.A.
      • Brusten L.C.M.
      • Draper N.J.
      • Elsmore B.M.
      • Freeman C.D.
      • Giles D.M.
      • Gong H.
      • Gormley D.
      • Griffiths M.R.
      • Hawkes T.D.R.
      • Haynes P.S.
      • Heesom K.J.
      • Herath A.
      • Hollis K.
      • Hudsen L.J.
      • Inman J.
      • Jacobs M.
      • Jarman D.
      • Kibria I.
      • Kilgour J.J.
      • Kinuthia S.K.
      • Lane K.E.
      • Lees M.L.
      • Loader J.
      • Longmore A.
      • McEwan M.
      • Middleton A.
      • Moore S.
      • Murray C.
      • Murray H.M.
      • Myatt C.P.
      • Ng S.S.
      • O’Neil A.
      • Parekh R.B.
      • Patel A.
      • Patel K.B.
      • Patel S.
      • Patel T.P.
      • Philp R.J.
      • Platt A.E.
      • Poyser H.
      • Prendergast C.
      • Prime S.
      • Redpath N.
      • Reeves M.
      • Robinson A.W.
      • Rohlff C.
      • Rosenbaum J.M.
      • Schenker M.
      • Scrivener E.
      • Shipston N.
      • Siddiq S.
      • Southan C.
      • Spencer D.I.R.
      • Stamps A.
      • Steffens M.A.
      • Stevenson D.
      • Sweetman G.M.A.
      • Taylor S.
      • Townsend R.
      • Ventom A.M.
      • Waller M.N.H.
      • Weresch C.
      • Williams A.M.
      • Woolliscroft R.J.
      • Yu X.
      • Lyall A.
      Annotation of the human genome by high-throughput sequence analysis of naturally occurring proteins.
      ,
      • Rohlff C.
      New approaches towards integrated proteomic databases and depositories.
      ). Furthermore performing secondary, centralized analysis of the datasets previously analyzed and published by individual laboratories can uncover interesting global trends not apparent in the analysis of any single dataset alone (Fig. 12).
      Figure thumbnail gr12
      Fig. 12Submission of mass spectrometry data to public repositories allows extraction of additional valuable information that otherwise would be missed in the analysis of a single experiment by an individual laboratory. More comprehensive characterization of the entire proteome can be achieved by combining the data from multiple diverse experiments (different tissues or cell types, enrichment schemes, etc.). In one such example, the PeptideAtlas project, MS/MS datasets from different laboratories are processed using the same high throughput pipeline. Identified peptides are mapped to the genome via the Ensembl gene index. Peptide sequences along with the chromosomal locations, sample annotation, and other information are stored in a relational database. The data can be visualized in the Ensembl genome browser, and the database itself can be mined to study global trends of protein expression. Peptide identification data, if communicated back to the database developers and annotators, can also be used to improve the quality of the protein sequence databases. Reanalysis of high quality MS/MS spectra that are left unassigned in a typical database search against a protein sequence database can lead to the identification of new open reading frames, novel splice forms, and sequence polymorphisms.
      The task of combining and comparing multiple large scale datasets generated using different biological samples (e.g. different cell states or tissues) requires the development of new approaches and computational tools. Due to the peptide-centric nature of shotgun proteomics, diverse datasets (from the same organism) can be best combined at the peptide level by linking the sequences of the identified peptides to a common gene index. One such approach, based on the mapping of peptides observed in a large group of proteomic experiments to the Ensembl genome, has been described recently (
      • Desiere F.
      • Deutsch E.W.
      • Nesvizhskii A.I.
      • Mallick P.
      • King N.L.
      • Eng J.K.
      • Aderem A.
      • Boyle R.
      • Brunner E.
      • Donohoe S.
      • Fausto N.
      • Hafen E.
      • Hood L.
      • Katze M.G.
      • Kennedy K.A.
      • Kregenow F.
      • Lee H.
      • Lin B.
      • Martin D.
      • Ranish J.A.
      • Rawlings D.J.
      • Samelson L.E.
      • Shiio Y.
      • Watts J.D.
      • Wollscheid B.
      • Wright M.E.
      • Yan W.
      • Yang L.
      • Yi E.C.
      • Zhang H.
      • Aebersold R.
      Integration with the human genome of peptide sequences obtained by high-throughput mass spectrometry.
      ) and implemented in a public resource, PeptideAtlas (www.peptideatlas.org). In this approach, peptide identifications passing a certain probability threshold are matched to proteins in the Ensembl database. The chromosomal coordinates, or multiple sets of coordinates in the case of peptides matching to more than one gene, and Ensembl protein accession numbers are retrieved for all matched peptides. The results are stored in a relational database and can be visualized using the Ensembl genome browser Distributed Annotation System (DAS) (
      • Dowel R.D.
      • Jokerst R.M.
      • Day A.
      • Eddy S.R.
      • Stein L.
      The distributed annotation system.
      ).
      By connecting the sequences of the identified peptides with the genome, PeptideAtlas allows gene-centered interpretation of the results of shotgun proteomic experiments in line with previous suggestions (
      • Rappsilber J.
      • Mann M.
      What does it mean to identify a protein in proteomics?.
      ). Most of the identified peptides have a unique association with the genome, and in those cases it is possible to state with certainty that a product of a certain gene has been identified. Some peptides map to several different locations on the genome due to the presence of gene paralogues, repeated protein domains, or simple sequence redundancy. PeptideAtlas database, and other emerging repositories of this kind (
      • Craig R.
      • Cortens J.P.
      • Beavis R.C.
      Open source system for analyzing, validating, and storing protein identification data.
      ), should be useful for validation and improved annotation of the human genome by complementing other types of data currently used for that purpose, such as mRNA and EST data, with large scale proteomic data.
      In turn, peptide identification data can be used to improve the quality of the protein sequence databases by making them more complete and accurate. For example, the identification of a peptide from a certain protein by searching MS/MS spectra against a protein sequence database would ensure that the sequence of that protein does not disappear from the database in the future (a situation not that uncommon at present). The use of mass spectrometry data can be further extended to go beyond genome validation (confirming the proteins already present in the current sequence databases) to the discovery of novel gene products and variants. For example, high quality MS/MS spectra that are left unassigned when searched against protein sequence databases such as IPI could be reanalyzed more comprehensively by searching genomic databases with the purpose to discover open reading frames missed by the current gene prediction programs, novel splice forms, or sequence polymorphisms.
      The information stored in PeptideAtlas, which includes experimental conditions and the type of cell or tissue analyzed, could also be used to statistically explore global trends of differential protein expression. Similar to the method of measuring gene expression using the number of corresponding expressed sequence tags in EST databases (
      • Skrabanek L.
      • Campagne F.
      TissueInfo: high-throughput identification of tissue expression profiles and specificity.
      ,
      • Mu X.
      • Zhao S.
      • Pershad R.
      • Hsieh T.F.
      • Scarpa A.
      • Wang S.W.
      • White R.A.
      • Beremand P.D.
      • Thomas T.L.
      • Gan L.
      • Klein W.H.
      Gene expression in the developing mouse retina by EST sequencing and microarray analysis.
      ,
      • Yeo G.
      • Holste D.
      • Kreiman G.
      • Burge C.B.
      Variation in alternative splicing across human tissues.
      ), the correlation between splice forms and disease states or tissue types can potentially be investigated at the level of proteins using, e.g. MS/MS spectrum counts as a rough measure of protein abundance. It can also be used to study the correlation between the physical-chemical properties of peptides and the likelihood of them being detected by a mass spectrometer (
      • Kuster B.
      • Schirle M.
      • Mallick P.
      • Aebersold R.
      ). It can be anticipated that eventually the computational tools (e.g. MS/MS database search tools or the tools for assembling peptides into proteins) will not treat all peptides equally but will use a weighting scheme to account for the probability of detecting a peptide. It will also be useful for selecting synthetic peptides for the absolute protein quantification using mass spectrometry or peptide arrays. The ability to perform these different analyses could make a significant contribution to our understanding of complex biological systems, thus significantly enhancing the overall value of shotgun proteomics.

      CONCLUDING REMARKS

      Shotgun proteomic technology has matured to a point where it can be used for routine identification and, when coupled with stable isotope labeling, accurate relative quantification of thousands of peptides in a single experiment. A significant effort has been made in recent years to improve various aspects of the technology, including extensive work on developing computational tools for identifying peptides from MS/MS spectra. It has also been recognized that the analysis of large scale shotgun proteomic datasets requires the application of transparent and tested statistical tools to estimate the confidence measures of peptide and protein identifications and to estimate false identification error rates in the published data. At the same time, significant inconsistencies still exist in how the information derived at the peptide level can be used to draw conclusions regarding the identities and quantities of the sample proteins and how the resulting protein identifications are interpreted in a biological context and published in the literature.
      The peptide-centric nature of shotgun proteomics becomes apparent in the analysis of data acquired on higher eukaryote organisms where a significant fraction of identified peptides can be assigned to more than one entry in the protein sequence database. In the best case scenario, seldom observed in shotgun proteomics, the sequences of the identified peptides would allow fairly complete characterization of the corresponding mature protein form expressed in the sample. This necessarily requires very high protein sequence coverage, including identification of the N-terminal peptide, and determination of the type and location of any post-translational modification. However, the identification of N-terminal peptides is not always possible (especially without specific enrichment for those peptides), and identification of post-translational modifications or sequence polymorphisms is also difficult. Thus, more often, observed peptide data allow identification of a certain protein but not accurate characterization of its mature form. In many cases, the sequences of the identified peptides would not be sufficient to allow differentiation between two or more splice forms of a particular gene. Furthermore in some cases it would only be possible to state with certainty that one or more members of a particular protein family are identified but not to single out any of them. The examples and discussion presented in this article should assist in the data interpretation process by providing general guidelines and a nomenclature for describing all these various protein identification scenarios. It is also hoped that this discussion will contribute to the development of more formal guidelines for publishing protein identification datasets obtained using shotgun proteomic strategy in the literature. Furthermore efforts are currently underway to develop common standards and schema for the representation, interchange, and storage of the results of proteomic experiments (
      • Taylor C.F.
      • Paton N.W.
      • Garwood K.L.
      • Kirby P.D.
      • Stead D.A.
      • Yin Z.
      • Deutsch E.W.
      • Selway L.
      • Walker J.
      • Riba-Garcia I.
      • Mohammed S.
      • Deery M.J.
      • Howard J.A.
      • Dunkley T.
      • Aebersold R.
      • Kell D.B.
      • Lilley K.S.
      • Roepstorff P.
      • Yates III, J.R.
      • Brass A.
      • Brown A.J.
      • Cash P.
      • Gaskell S.J.
      • Hubbard S.J.
      • Oliver S.G.
      A systematic approach to modeling capturing and disseminating proteomics experimental data.
      ,
      • Pedrioli P.G.
      • Eng J.K.
      • Hubley R.
      • Vogelzang M.
      • Deutsch E.W.
      • Raught B.
      • Pratt B.
      • Nilsson E.
      • Angeletti R.H.
      • Apweiler R.
      • Cheung K.
      • Costello C.E.
      • Hermjakob H.
      • Huang S.
      • Julian R.K.
      • Kapp E.
      • McComb M.E.
      • Oliver S.G.
      • Omenn G.
      • Paton N.W.
      • Simpson R.
      • Smith R.
      • Taylor C.F.
      • Zhu W.
      • Aebersold R.
      A common open representation of mass spectrometry data and its application to proteomics research.
      ,
      • Orchard S.
      • Zhu W.
      • Julian R.K.
      • Hermjakob H.
      • Apweiler R.
      Further advances in the development of a data interchange standard for proteomics data.
      ). The issues discussed here should be taken into consideration in developing such standards.
      Understanding of these data interpretation difficulties is helpful for deciding upon what experimental strategy is most appropriate given the aims of each particular experiment as well as for the development of new experimental and computational approaches. Higher protein coverage, which leads to improved ability to differentiate between protein isoforms and to identify sites of post-translational modifications, in some cases can be achieved by using multiple digestion enzymes (
      • MacCoss M.J.
      • McDonald W.H.
      • Saraf A.
      • Sadygov R.
      • Clark J.M.
      • Tasto J.J.
      • Gould K.L.
      • Wolters D.
      • Washburn M.
      • Weiss A.
      • Clark J.I.
      • Yates J.R.
      Shotgun identification of protein modifications from protein complexes and lens tissue.
      ,
      • Choudhary G.
      • Wu S.L.
      • Shieh P.
      • Hancock W.S.
      Multiple enzymatic digestion for enhanced sequence coverage of proteins in complex proteomic mixtures using capillary LC with ion trap MS/MS.
      ). Another strategy is based on generation of synthetic peptides that are unique to a particular isoform of interest (
      • Aebersold R.
      Constellations in a cellular universe.
      ,
      • Kuster B.
      • Schirle M.
      • Mallick P.
      • Aebersold R.
      ,
      • Pan S.
      • Zhang H.
      • Rush J.
      • Eng J.
      • Zhang N.
      • Patterson D.
      • Comb M.J.
      • Aebersold R.
      High throughput proteome-screening for biomarker detection. (2005).
      ). Such peptides can be selected (using computational approaches maximizing the likelihood of those peptides being detected by a mass spectrometer), synthesized, isotopically labeled, and spiked into an isotopically labeled biological sample. This should allow selective sequencing and quantification of peptides that are unique to the proteins of interest. It has also been suggested that peptide sequencing can be performed in two stages. After peptides are identified from the MS/MS spectra acquired at the first stage, indistinguishable sequence database entries are aligned, and peptides that discriminate between different isoforms are predicted from unique stretches. The second stage of the MS/MS sequencing process would then be directed toward the analysis of those predicted distinct peptides (
      • Rappsilber J.
      • Mann M.
      What does it mean to identify a protein in proteomics?.
      ).
      The shotgun proteomic approach alone does not appear to be sufficient to comprehensively and unambiguously characterize the proteome. Complementary to shotgun proteomics (often called the “bottom-up” proteomic approach), the “top-down” proteomic approaches that deal with intact proteins (or involve extensive protein separation prior to digestion) offer certain advantages with regard to the discrimination between protein isoforms and characterization of post-translational modifications. The most established of these methods are based on 2D gels (
      • Gorg A.
      • Weiss W.
      • Dunn M.J.
      Current two-dimensional electrophoresis technology for proteomics.
      ,
      • Pedersen S.K.
      • Harry J.L.
      • Sebastian L.
      • Baker J.
      • Traini M.D.
      • McCarthy J.T.
      • Manoharan A.
      • Wilkins M.R.
      • Gooley A.A.
      • Righetti P.G.
      • Packer N.H.
      • Williams K.L.
      • Herbert B.R.
      Unseen proteome: mining below the tip of the iceberg to find low abundance and membrane proteins.
      ,
      • Fung K.Y.
      • Glode L.M.
      • Green S.
      • Duncan M.W.
      A comprehensive characterization of the peptide and protein constituents of human seminal fluid.
      ,
      • Godovac-Zimmermann J.
      • Kleiner O.
      • Brown L.L.
      • Drukier A.L.
      Perspectives in splicing up proteomics with splicing.
      ). However, gel-based methods have known limitations such as low detection sensitivity, bias toward high abundance proteins, and difficulty in resolving internal membrane or basic proteins. Non-gel-based multidimensional protein separation methods are being developed and can circumvent some of these limitations (
      • Wall D.B.
      • Kachman M.T.
      • Gong S.S.
      • Parus S.J.
      • Long M.W.
      • Lubman D.M.
      Isoelectric focusing nonporous silica reversed-phase high-performance liquid chromatography/electrospray ionization time-of-flight mass spectrometry: a three-dimensional liquid-phase protein separation method as applied to the human erythroleukemia cell-line.
      ,
      • Liu H.
      • Berger S.J.
      • Chakraborty A.B.
      • Plumb R.S.
      • Cohen S.A.
      Multidimensional chromatography coupled to electrospray ionization time-of-flight mass spectrometry as an alternative to two-dimensional gels for the identification and analysis of complex mixtures of intact proteins.
      ,
      • Wienkoop S.
      • Glinski M.
      • Tanaka N.
      • Tolstikov V.
      • Fiehn O.
      • Weckwerth W.
      Linking protein fractionation with multidimensional monolithic reversed-phase peptide chromatography/ mass spectrometry enhances protein identification from complex mixtures even in the presence of abundant proteins.
      ,
      • Moritz R.L.
      • Ji H.
      • Schutz F.
      • Connolly L.M.
      • Kapp E.A.
      • Speed T.P.
      • Simpson R.J.
      A proteome strategy for fractionating proteins and peptides using continuous free-flow electrophoresis coupled off-line to reversed-phase high-performance liquid chromatography.
      ). Another promising top-down protein characterization technique is based on MS/MS sequencing of intact proteins (
      • Reid G.E.
      • McLuckey S.A.
      “Top down” protein characterization via tandem mass spectrometry.
      ,
      • Meng F.
      • Forbes A.J.
      • Miller L.M.
      • Kelleher N.L.
      Detection and localization of protein modifications by high resolution tandem mass spectrometry.
      ,
      • Lee S.W.
      • Berger S.J.
      • Martinovic S.
      • Pasa-Tolic L.
      • Anderson G.A.
      • Shen Y.
      • Zhao R.
      • Smith R.D.
      Direct mass spectrometric analysis of intact proteins of the yeast large ribosomal subunit using capillary LC/FTICR.
      ). Although still not at the level of automation and data throughput currently achievable in shotgun proteomics, this technology has experienced significant advances in the last few years. An attractive approach is to integrate the measurements performed on the same systems both at the level of peptides and intact proteins (
      • Wall D.B.
      • Kachman M.T.
      • Gong S.S.
      • Parus S.J.
      • Long M.W.
      • Lubman D.M.
      Isoelectric focusing nonporous silica reversed-phase high-performance liquid chromatography/electrospray ionization time-of-flight mass spectrometry: a three-dimensional liquid-phase protein separation method as applied to the human erythroleukemia cell-line.
      ,
      • Liu H.
      • Berger S.J.
      • Chakraborty A.B.
      • Plumb R.S.
      • Cohen S.A.
      Multidimensional chromatography coupled to electrospray ionization time-of-flight mass spectrometry as an alternative to two-dimensional gels for the identification and analysis of complex mixtures of intact proteins.
      ,
      • VerBerkmoes N.C.
      • Bundy J.L.
      • Hauser L.
      • Asano K.G.
      • Razumovskaya J.
      • Larimer F.
      • Hettich R.L.
      • Stephenson Jr., J.L.
      Integrating “top-down” and “bottom-up” mass spectrometric approaches for proteomic analysis of Shewanella oneidensis.
      ,
      • Strader M.B.
      • VerBerkmoes N.C.
      • Tabb D.L.
      • Connelly H.M.
      • Barton J.W.
      • Bruce B.D.
      • Pelletier D.A.
      • Davison B.H.
      • Hettich R.L.
      • Larimer F.W.
      • Hurst G.B.
      Characterization of the 70S ribosome from Rhodopseudomonas palustris using an integrated “top-down” and “bottom-up” mass spectrometric approach.
      ,
      • Nemeth-Cawley J.F.
      • Tangarone B.S.
      • Rouse J.C.
      “Top down” characterization is a complementary technique to peptide sequencing for identifying protein species in complex mixtures.
      ,
      • Wang H.
      • Kachman M.T.
      • Schwartz D.R.
      • Cho K.R.
      • Lubman D.M.
      Comprehensive proteome analysis of ovarian cancers using liquid phase separation, mass mapping and tandem mass spectrometry: a strategy for identification of candidate cancer biomarkers.
      ). In this method, the molecular weights of intact proteins are measured using high mass accuracy instruments such as ESI-FTMS or ESI-TOF. In parallel, proteins are digested, and peptides are sequenced using a typical shotgun proteomics set-up and/or using a peptide fingerprinting method. The advantage of this approach (in the context of the protein inference problem) is that the process of assembling peptides into proteins and discrimination between protein isoforms can be assisted by the knowledge of the molecular weights of the sample proteins. Other proposed approaches include generation of isoform-specific affinity ligands such as antibodies or peptides for selective targeting of proteins of interest (
      • Humphery-Smith I.
      A human proteome project with a beginning and an end.
      ,
      • Uhlen M.
      • Ponten F.
      Antibody-based proteomics for human tissue profiling.
      ). Additional insights can be obtained by integrating measurements performed on the same biological systems but at different levels, e.g. proteomic and transcriptional measurements. The knowledge available from microarray experiments regarding the presence or absence of a certain mRNA transcript can assist in the process of assigning peptides to the corresponding protein isoforms observed in proteomic experiments.
      It has been stressed already that shotgun proteomic datasets should be analyzed using transparent computational tools that are well documented and made generally available to the scientific community. Even when this is the case, however, publication of long lists of protein and peptide identifications by itself has only a limited value. As databases become updated, new protein sequences are added, some sequences are removed, and annotations or accession schemes change, those lists become obsolete and can no longer be easily interpreted or correlated with other data. Thus, the authors should be encouraged to provide access to all raw data, or at least to MS/MS spectra, as a part of the publication. This would allow re-evaluation of the primary MS data using the most up-to-date protein sequence databases. Coupled with the development of open MS data formats (
      • Pedrioli P.G.
      • Eng J.K.
      • Hubley R.
      • Vogelzang M.
      • Deutsch E.W.
      • Raught B.
      • Pratt B.
      • Nilsson E.
      • Angeletti R.H.
      • Apweiler R.
      • Cheung K.
      • Costello C.E.
      • Hermjakob H.
      • Huang S.
      • Julian R.K.
      • Kapp E.
      • McComb M.E.
      • Oliver S.G.
      • Omenn G.
      • Paton N.W.
      • Simpson R.
      • Smith R.
      • Taylor C.F.
      • Zhu W.
      • Aebersold R.
      A common open representation of mass spectrometry data and its application to proteomics research.
      ,
      • Orchard S.
      • Zhu W.
      • Julian R.K.
      • Hermjakob H.
      • Apweiler R.
      Further advances in the development of a data interchange standard for proteomics data.
      ), centralized data repositories (
      • Martens L.
      • Hermjakob H.
      • Jones P.
      • Adamski M.
      • Taylor C.F.
      • States D.
      • Gevaert K.
      • Vandekerckhove J.
      • Apweiler R.
      PRIDE: the proteomics identifications database.
      ,
      • Desiere F.
      • Deutsch E.W.
      • Nesvizhskii A.I.
      • Mallick P.
      • King N.L.
      • Eng J.K.
      • Aderem A.
      • Boyle R.
      • Brunner E.
      • Donohoe S.
      • Fausto N.
      • Hafen E.
      • Hood L.
      • Katze M.G.
      • Kennedy K.A.
      • Kregenow F.
      • Lee H.
      • Lin B.
      • Martin D.
      • Ranish J.A.
      • Rawlings D.J.
      • Samelson L.E.
      • Shiio Y.
      • Watts J.D.
      • Wollscheid B.
      • Wright M.E.
      • Yan W.
      • Yang L.
      • Yi E.C.
      • Zhang H.
      • Aebersold R.
      Integration with the human genome of peptide sequences obtained by high-throughput mass spectrometry.
      ,
      • Craig R.
      • Cortens J.P.
      • Beavis R.C.
      Open source system for analyzing, validating, and storing protein identification data.
      ,
      • Prince J.T.
      • Carlson M.W.
      • Wang R.
      • Lu P.
      • Marcotte E.M.
      The need for a public proteomics repository.
      ), and infrastructure for processing and integrating datasets from different experiments (
      • Martens L.
      • Hermjakob H.
      • Jones P.
      • Adamski M.
      • Taylor C.F.
      • States D.
      • Gevaert K.
      • Vandekerckhove J.
      • Apweiler R.
      PRIDE: the proteomics identifications database.
      ,
      • Desiere F.
      • Deutsch E.W.
      • Nesvizhskii A.I.
      • Mallick P.
      • King N.L.
      • Eng J.K.
      • Aderem A.
      • Boyle R.
      • Brunner E.
      • Donohoe S.
      • Fausto N.
      • Hafen E.
      • Hood L.
      • Katze M.G.
      • Kennedy K.A.
      • Kregenow F.
      • Lee H.
      • Lin B.
      • Martin D.
      • Ranish J.A.
      • Rawlings D.J.
      • Samelson L.E.
      • Shiio Y.
      • Watts J.D.
      • Wollscheid B.
      • Wright M.E.
      • Yan W.
      • Yang L.
      • Yi E.C.
      • Zhang H.
      • Aebersold R.
      Integration with the human genome of peptide sequences obtained by high-throughput mass spectrometry.
      ,
      • Craig R.
      • Cortens J.P.
      • Beavis R.C.
      Open source system for analyzing, validating, and storing protein identification data.
      ), this would allow new uses of proteomic data such as validation of genes that are expressed on the protein level or elucidation of global protein expression patterns that would otherwise be missed in an analysis of a single experiment. Finally if communicated back to the database developers and annotators, MS-derived proteomic data could become a useful resource in the process of annotating the genomes of the corresponding organisms.

      Acknowledgments

      We acknowledge fruitful discussions with Karl Clauser, Frank Desiere, Eric Deutsch, Jimmy Eng, Anne-Claude Gingras, Andrew Keller, Jeff Kowalack, Xiao-jun Li, Parag Mallick, Jeff Ranish, Katheryn Resing, Julian Watts, and Bernd Wollscheid. We are particularly grateful to Anne-Claude Gingras, Jeff Ranish, and Bernd Wollscheid for reading the manuscript and to Nichole King and James Eddes for help with Table I and Fig. 7.

      REFERENCES

        • Aebersold R.
        • Mann M.
        Mass spectrometry-based proteomics.
        Nature. 2003; 422: 198-207
        • Link A.J.
        • Eng J.
        • Schieltz D.M.
        • Carmack E.
        • Mize G.J.
        • Morris D.R.
        • Garvik B.M.
        • Yates J.R.
        Direct analysis of protein complexes using mass spectrometry.
        Nat. Biotechnol. 1999; 17: 676-682
        • Gygi S.P.
        • Rist B.
        • Gerber S.A.
        • Turecek F.
        • Gelb M.H.
        • Aebersold R.
        Quantitative analysis of complex protein mixtures using isotope-coded affinity tags.
        Nat. Biotechnol. 1999; 17: 994-999
        • Washburn M.P.
        • Wolters D.
        • Yates J.R.
        Large-scale analysis of the yeast proteome by multidimensional protein identification technology.
        Nat. Biotechnol. 2001; 19: 242-247
        • Reid G.E.
        • McLuckey S.A.
        “Top down” protein characterization via tandem mass spectrometry.
        J. Mass Spectrom. 2002; 37: 663-675
        • Meng F.
        • Forbes A.J.
        • Miller L.M.
        • Kelleher N.L.
        Detection and localization of protein modifications by high resolution tandem mass spectrometry.
        Mass Spectrom. Rev. 2005; 24: 126-134
        • Gorg A.
        • Weiss W.
        • Dunn M.J.
        Current two-dimensional electrophoresis technology for proteomics.
        Proteomics. 2004; 4: 3665-3685
        • Patterson S.D.
        Data analysis—the Achilles heel of proteomics.
        Nat. Biotechnol. 2003; 21: 221-222
        • Boguski M.S.
        • McIntosh M.W.
        Biomedical informatics for proteomics.
        Nature. 2003; 422: 233-237
        • Nesvizhskii A.I.
        • Aebersold R.
        Analysis, statistical validation and dissemination of large-scale proteomics datasets generated by tandem MS.
        Drug Discov. Today. 2004; 9: 173-181
        • Johnson R.S.
        • Davis M.T.
        • Taylor J.A.
        • Patterson S.D.
        Informatics for protein identification by mass spectrometry.
        Methods. 2005; 35: 223-236
        • Russell S.A.
        • Old W.
        • Resing K.A.
        • Hunter L.
        Proteomic informatics.
        Int. Rev. Neurobiol. 2004; 61: 129-157
        • Baldwin M.A.
        Protein identification by mass spectrometry: issues to be considered.
        Mol. Cell. Proteomics. 2004; 3: 1-9
        • Eng J.K.
        • McCormack A.L.
        • Yates J.R.
        An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database.
        J. Am. Soc. Mass Spectrom. 1994; 5: 976-989
        • Mann M.
        • Wilm M.
        Error-tolerant identification of peptides in sequence databases by peptide sequence tags.
        Anal. Chem. 1994; 66: 4390-4399
        • Perkins D.N.
        • Pappin D.J.
        • Creasy D.M.
        • Cottrell J.C.
        Probability-based protein identification by searching sequence databases using mass spectrometry data.
        Electrophoresis. 1999; 20: 3551-3567
        • Clauser K.R.
        • Baker P.
        • Burlingame A.L.
        Role of accurate mass measurement (±10 ppm) in protein identification strategies employing MS or MS/MS and database searching.
        Anal. Chem. 1999; 71: 2871-2882
        • Field H.I.
        • Fenyo D.
        • Beavis R.C.
        RADARS, a bioinformatics solution that automates proteome mass spectral analysis, optimizes protein identification, and archives data in a relational database.
        Proteomics. 2002; 2: 36-47
        • Craig R.
        • Beavis R.C.
        TANDEM: matching proteins with tandem mass spectra.
        Bioinformatics. 2004; 20: 1466-1467
        • Geer L.Y.
        • Markey S.P.
        • Kowalak J.A.
        • Wagner L.
        • Xu M.
        • Maynard D.M.
        • Yang X.
        • Shi W.
        • Bryant S.H.
        Open mass spectrometry search algorithm.
        J. Proteome Res. 2004; 3: 958-964
        • Keller A.
        • Nesvizhskii A.I.
        • Kolker E.
        • Aebersold R.
        Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search.
        Anal. Chem. 2002; 74: 5383-5392
        • Nesvizhskii A.I.
        • Keller A.
        • Kolker E.
        • Aebersold R.
        A statistical model for identifying proteins by tandem mass spectrometry.
        Anal. Chem. 2003; 75: 4646-4658
        • Fenyo D.
        • Beavis R.C.
        A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes.
        Anal Chem. 2003; 75: 768-774
        • Peng J.
        • Elias J.E.
        • Thoreen C.C.
        • Licklider L.J.
        • Gygi S.P.
        Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large scale protein analysis: the yeast proteome.
        J. Proteome Res. 2003; 2: 43-50
        • Sadygov R.G.
        • Yates J.R.
        A hypergeometric probability model for protein identification and validation using tandem mass spectral data and protein sequence databases.
        Anal. Chem. 2003; 75: 3792-3798
        • Rappsilber J.
        • Mann M.
        What does it mean to identify a protein in proteomics?.
        Trends Biochem. Sci. 2002; 27: 74-78
        • Von Haller P.D.
        • Yi E.
        • Donohoe S.
        • Vaughn K.
        • Keller A.
        • Nesvizhskii A.I.
        • Eng J.
        • Li X.J.
        • Goodlett D.R.
        • Aebersold R.
        • Watts J.D.
        The application of new software tools to quantitative protein profiling via ICAT and tandem mass spectrometry: II. Evaluation of tandem mass spectrometry methodologies for large-scale protein analysis and the application of statistical tools for data analysis and interpretation.
        Mol. Cell. Proteomics. 2003; 2: 428-442
        • Resing K.A.
        • Meyer-Arendt K.
        • Mendoza A.M.
        • Aveline-Wolf L.D.
        • Jonscher K.R.
        • Pierce K.G.
        • Old W.M.
        • Cheung H.T.
        • Russell S.
        • Wattawa J.L.
        • Goehle G.R.
        • Knight R.D.
        • Ahn N.G.
        Improving reproducibility and sensitivity in identifying human proteins by shotgun proteomics.
        Anal. Chem. 2004; 76: 3556-3568
        • Carr S.
        • Aebersold R.
        • Baldwin M.
        • Burlingame A.
        • Clauser K.
        • Nesvizhskii A.
        The need for guidelines in publication of peptide and protein identification data.
        Mol. Cell. Proteomics. 2004; 3: 531-533
        • Yang X.
        • Dondeti V.
        • Dezube R.
        • Maynard D.M.
        • Geer L.Y.
        • Epstein J.
        • Chen X.
        • Markey S.P.
        • Kowalak J.A.
        DBParser: web-based software for shotgun proteomic data analyses.
        J. Proteome Res. 2004; 3: 1002-1008
        • Pedersen S.K.
        • Harry J.L.
        • Sebastian L.
        • Baker J.
        • Traini M.D.
        • McCarthy J.T.
        • Manoharan A.
        • Wilkins M.R.
        • Gooley A.A.
        • Righetti P.G.
        • Packer N.H.
        • Williams K.L.
        • Herbert B.R.
        Unseen proteome: mining below the tip of the iceberg to find low abundance and membrane proteins.
        J. Proteome Res. 2003; 2: 303-311
        • Fung K.Y.
        • Glode L.M.
        • Green S.
        • Duncan M.W.
        A comprehensive characterization of the peptide and protein constituents of human seminal fluid.
        Prostate. 2004; 61: 171-181
        • Godovac-Zimmermann J.
        • Kleiner O.
        • Brown L.L.
        • Drukier A.L.
        Perspectives in splicing up proteomics with splicing.
        Proteomics. 2005; 5: 699-709
        • Black D.L.
        Protein diversity from alternative splicing: a challenge for bioinformatics and post-genome biology.
        Cell. 2000; 103: 367-370
        • Delalande F.
        • Carapito C.
        • Brizard J.P.
        • Brigidou C.
        • Dorsselaer A.V.
        Multigenic families and proteomics: Extended protein characterization as a tool for paralog gene identification.
        Proteomics. 2005; 5: 450-460
        • Sam-Yellowe T.Y.
        • Florens L.
        • Johnson J.R.
        • Wang T.
        • Drazba J.A.
        • Le Roch K.G.
        • Zhou Y.
        • Batalov S.
        • Carucci D.J.
        • Winzeler E.A.
        • Yates J.R.
        A Plasmodium gene family encoding Maurer’s cleft membrane proteins: structural properties and expression profiling.
        Genome Res. 2004; 14: 1052-1059
        • Kislinger T.
        • Rahman K.
        • Radulovic D.
        • Cox B.
        • Rossant J.
        • Emili A.
        PRISM, a generic large scale proteomic investigation strategy for mammals.
        Mol. Cell. Proteomics. 2003; 2: 96-106
        • Kristensen D.B.
        • Brond J.C.
        • Nielsen P.A.
        • Andersen J.R.
        • Sorensen O.T.
        • Jorgensen V.
        • Budin K.
        • Matthiesen J.
        • Veno P.
        • Jespersen H.M.
        • Ahrens C.H.
        • Schandorff S.
        • Ruhoff P.T.
        • Wisniewski J.R.
        • Bennett K.L.
        • Podtelejnikov A.V.
        Experimental Peptide Identification Repository (EPIR): an integrated peptide-centric platform for validation and mining of tandem mass spectrometry data.
        Mol. Cell. Proteomics. 2004; 3: 1023-1038
        • Tabb D.L.
        • McDonald W.H.
        • Yates J.R.
        DTASelect and Contrast: tools for assembling and comparing protein identifications from shotgun proteomics.
        J. Proteome Res. 2002; 1: 21-26
        • Allet N.
        • Barrillat N.
        • Baussant T.
        • Boiteau C.
        • Botti P.
        • Bougueleret L.
        • Budin N.
        • Canet D.
        • Carraud S.
        • Chiappe D.
        • Christmann N.
        • Colinge J.
        • Cusin I.
        • Dafflon N.
        • Depresle B.
        • Fasso I.
        • Frauchiger P.
        • Gaertner H.
        • Gleizes A.
        • Gonzalez-Couto E.
        • Jeandenans C.
        • Karmime A.
        • Kowall T.
        • Lagache S.
        • Mahe E.
        • Masselot A.
        • Mattou H.
        • Moniatte M.
        • Niknejad A.
        • Paolini M.
        • Perret F.
        • Pinaud N.
        • Ranno F.
        • Raimondi S.
        • Reffas S.
        • Regamey P.O.
        • Rey P.A.
        • Rodriguez-Tome P.
        • Rose K.
        • Rossellat G.
        • Saudrais C.
        • Schmidt C.
        • Villain M.
        • Zwahlen C.
        In vitro and in silico processes to identify differentially expressed proteins.
        Proteomics. 2004; 4: 2333-2351
        • Martens L.
        • Hermjakob H.
        • Jones P.
        • Adamski M.
        • Taylor C.F.
        • States D.
        • Gevaert K.
        • Vandekerckhove J.
        • Apweiler R.
        PRIDE: the proteomics identifications database.
        Proteomics. 2005; (in press)
        • Apweiler R.
        • Bairoch A.
        • Wu C.H.
        Protein sequence databases.
        Curr. Opin. Chem. Biol. 2004; 8: 76-80
        • Wheeler D.L.
        • Church D.M.
        • Edgar R.
        • Federhem S.
        • Helmberg W.
        • Madden T.L.
        • Pontius J.U.
        • Schuler G.D.
        • Schriml L.M.
        • Sequeira E.
        • Suzek T.O.
        • Tatusova T.A.
        • Wagner L.
        Database resources of the National Center for Biotechnology Information: update.
        Nucleic Acids Res. 2004; 32: D35-D40
        • Boeckmann B.
        • Bairoch A.
        • Apweiler R.
        • Blatter M.
        • Estreicher A.
        • Gasteiger E.
        • Martin M.J.
        • Michoud K.
        • O’Donovan C.
        • Phan I.
        • Pilbout S.
        • Schneider M.
        The Swiss-Prot protein knowledgebase and its supplement TrEMBL in 2003.
        Nucleic Acids Res. 2003; 31: 365-370
        • Pruitt K.D.
        • Tatusova T.
        • Maglott D.R.
        NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins.
        Nucleic Acids Res. 2005; 33: D501-D504
        • Kersey P.J.
        • Duarte J.
        • Williams A.
        • Karavidopoulou Y.
        • Birney E.
        • Apweiler R.
        The International Protein Index: an integrated database for proteomics experiments.
        Proteomics. 2004; 4: 1985-1988
        • Birney E.
        • Andrews D.
        • Bevan P.
        • Caccamo M.
        • Cameron G.
        • Chen Y.
        • Clarke L.
        • Coates G.
        • Cox T.
        • Cuff J.
        • Curwen V.
        • Cutts T.
        • Down T.
        • Durbin R.
        • Eyras E.
        • Fernandez-Suarez X.M.
        • Gane P.
        • Gibbins B.
        • Gilbert J.
        • Hammond M.
        • Hotz H.
        • Iyer V.
        • Kahari A.
        • Jekosch K.
        • Kasprzyk A.
        • Keefe D.
        • Keenan S.
        • Lehvaslaiho H.
        • McVicker G.
        • Melsopp C.
        • Meidl P.
        • Mongin E.
        • Pettett R.
        • Potter S.
        • Proctor G.
        • Rae M.
        • Searle S.
        • Slater G.
        • Smedley D.
        • Smith J.
        • Spooner W.
        • Stabenau A.
        • Stalker J.
        • Storey R.
        • Ureta-Vidal A.
        • Woodwark C.
        • Clamp M.
        • Hubbard T.
        Ensembl 2004.
        Nucleic Acids Res. 2004; 32: D468-D470
        • Kuster B.
        • Mortensen P.
        • Andersen J.S.
        • Mann M.
        Mass spectrometry allows direct identification of proteins in large genomes.
        Proteomics. 2001; 1: 641-650
        • Choudhary J.S.
        • Blackstock W.P.
        • Creasy D.M.
        • Cottrell J.S.
        Interrogating the human genome using uninterpreted mass spectrometry data.
        Proteomics. 2001; 1: 651-667
        • Mann M.
        • Pandey A.
        Use of mass spectrometry-derived data to annotate nucleotide and protein sequence databases.
        Trends Biochem. Sci. 2001; 26: 54-60
        • Desiere F.
        • Deutsch E.W.
        • Nesvizhskii A.I.
        • Mallick P.
        • King N.L.
        • Eng J.K.
        • Aderem A.
        • Boyle R.
        • Brunner E.
        • Donohoe S.
        • Fausto N.
        • Hafen E.
        • Hood L.
        • Katze M.G.
        • Kennedy K.A.
        • Kregenow F.
        • Lee H.
        • Lin B.
        • Martin D.
        • Ranish J.A.
        • Rawlings D.J.
        • Samelson L.E.
        • Shiio Y.
        • Watts J.D.
        • Wollscheid B.
        • Wright M.E.
        • Yan W.
        • Yang L.
        • Yi E.C.
        • Zhang H.
        • Aebersold R.
        Integration with the human genome of peptide sequences obtained by high-throughput mass spectrometry.
        Genome Biol. 2005; 6: R5
        • Brandt U.
        • Yu L.
        • Yu C.A.
        • Trumpower B.L.
        The mitochondrial targeting presequence of the Rieske iron-sulfur protein is processed in a single step after insertion into the cytochrome bc1 complex in mammals and retained as a subunit in the complex.
        J. Biol. Chem. 1993; 268: 8387-8390
        • Gevaert K.
        • Goethais M.
        • Martens L.
        • Van Damme J.
        • Staes A.
        • Thomas G.R.
        • Vandekerckhove J.
        Exploring proteomes and analyzing protein processing by mass spectrometric identification of sorted N-terminal peptides.
        Nat. Biotechnol. 2003; 21: 566-569
        • Song H.
        • Hecimovic S.
        • Goate A.
        • Hsu F.F.
        • Bao S.
        • Vidavsky I.
        • Ramanadham S.
        • Turk J.
        Characterization of N-terminal processing of group VIA phospholipase A2 and of potential cleavage sites of amyloid precursor protein constructs by automated identification of signature peptides in LC/MS/MS analyses of proteolytic digests.
        J. Am. Soc. Mass Spectrom. 2004; 15: 1780-1793
        • Zhang H.
        • Yan W.
        • Aebersold R.
        Chemical probes and tandem mass spectrometry: a strategy for the quantitative analysis of proteomes and subproteomes.
        Curr. Opin. Chem. Biol. 2004; 8: 66-75
        • Julka S.
        • Regnier F.
        Quantification in proteomics through stable isotope coding: a review.
        J. Proteome Res. 2004; 3: 350-363
        • Goshe M.B.
        • Smith R.D.
        Stable isotope-coded proteomic mass spectrometry.
        Curr. Opin. Biotechnol. 2003; 14: 101-109
        • Gerber S.A.
        • Rush J.
        • Stemman O.
        • Kirschner M.W.
        • Gygi S.P.
        Absolute quantification of proteins and phosphoproteins from cell lysates by tandem MS.
        Proc. Natl. Acad. Sci. U. S. A. 2003; 100: 6940-6945
        • Aebersold R.
        Constellations in a cellular universe.
        Nature. 2003; 422: 115-116
        • Kuster B.
        • Schirle M.
        • Mallick P.
        • Aebersold R.
        Nat. Rev. Mol. Cell. Biol. 2005; 6: 577-583
        • Schena M.
        Microarray Analysis.
        Wiley-Liss, Hoboken, NJ2003
        • Han D.K.
        • Eng J.
        • Zhou H.
        • Aebersold R.
        Quantitative profiling of differentiation-induced microsomal proteins using isotope-coded affinity tags and mass spectrometry.
        Nat. Biotechnol. 2001; 19: 946-951
        • Li X.J.
        • Zhang H.
        • Ranish J.A.
        • Aebersold R.
        Automated statistical analysis of protein abundance ratios from data generated by stable-isotope dilution and tandem mass spectrometry.
        Anal Chem. 2003; 75: 6648-6657
        • MacCoss M.J.
        • Wu C.C.
        • Liu H.
        • Sadygov R.
        • Yates J.R.
        A correlation algorithm for the automated quantitative analysis of shotgun proteomics data.
        Anal. Chem. 2003; 75: 6912-6921
        • Halligan B.D.
        • Slyper R.Y.
        • Twigger S.N.
        • Hicks W.
        • Olivier M.
        • Greene A.S.
        ZoomQuant: an application for the quantitation of stable isotope labeled peptides.
        J. Am. Soc. Mass Spectrom. 2005; 16: 302-306
        • Ranish J.A.
        • Yi E.C.
        • Leslie D.M.
        • Purvine S.O.
        • Goodlett D.R.
        • Eng J.
        • Aebersold R.
        The study of macromolecular complexes by quantitative proteomics.
        Nat. Genet. 2003; 33: 349-355
        • Foster L.J.
        • De Hoog C.L.
        • Mann M.
        Unbiased quantitative proteomics of lipid rafts reveals high specificity for signaling factors.
        Proc. Natl. Acad. Sci. U. S. A. 2003; 100: 5813-5818
        • Marelli M.
        • Smith J.J.
        • Jung S.
        • Yi E.
        • Nesvizhskii A.I.
        • Christmas R.H.
        • Saleem R.A.
        • Tam Y.Y.C.
        • Faragasanu A.
        • Goodlett D.R.
        • Aebersold R.
        • Rachubinski R.A.
        • Aitchison J.D.
        Quantitative mass spectrometry reveals a role for the GTPase Rho1p in actin organization on the peroxisome membrane.
        J. Cell Biol. 2004; 167: 1099-1112
        • Gingras A.C.
        • Aebersold R.
        • Raught B.
        Advances in protein complex analysis using mass spectrometry.
        J. Physiol. 2005; 563: 11-21
        • Liu H.
        • Sadygov R.G.
        • Yates J.R.
        A model for random sampling and estimation of relative protein abundances in shotgun proteomics.
        Anal. Chem. 2004; 76: 4193-4201
        • Blondeau F.
        • Ritter B.
        • Allaire P.D.
        • Wasiak S.
        • Girard M.
        • Hussain N.K.
        • Angers A.
        • Legendre-Guillemin V.
        • Roy L.
        • Boismenu D.
        • Kearney R.E.
        • Bell A.W.
        • Bergeron J.J.
        • McPherson P.S.
        Tandem MS analysis of brain clathrin-coated vesicles reveals their critical involvement in synaptic vesicle recycling.
        Proc. Natl. Acad. Sci. U. S. A. 2004; 101: 3833-3838
        • Chelius D.
        • Bondarenko P.V.
        Quantitative profiling of proteins in complex mixtures using liquid chromatography and mass spectrometry.
        J. Proteome Res. 2002; 1: 317-323
        • Wang W.
        • Zhou H.
        • Lin H.
        • Roy S.
        • Shaler T.A.
        • Hill L.R.
        • Norton S.
        • Kumar P.
        • Anderle M.
        • Becker C.H.
        Quantification of proteins and metabolites by mass spectrometry without isotopic labeling or spiked standards.
        Anal. Chem. 2003; 75: 4818-4826
        • Himeda C.L.
        • Ranish J.A.
        • Angello J.C.
        • Maire P.
        • Aebersold R.
        • Hauschka S.D.
        Quantitative proteomics identification of Six4 as the Trex-binding factor in the muscle creatine kinase enhancer.
        Mol. Cell. Biol. 2004; 24: 2132-2143
        • Lee C.
        • Roy M.
        Analysis of alternative splicing with microarrays: successes and challenges.
        Genome Biol. 2004; 5: 231
        • Johnson J.M.
        • Edwards S.
        • Shoemaker D.
        • Schadt E.E.
        Dark matter in the genome: evidence of widespread transcription detected by microarray tiling experiments.
        Trends Genet. 2005; 21: 93-102
        • Gygi S.P.
        • Rochon Y.
        • Franza B.R.
        • Aebersold R.
        Correlation between protein and mRNA abundance in yeast.
        Mol. Cell. Biol. 1999; 19: 1720-1730
        • Chen G.
        • Gharib T.G.
        • Huang C.C.
        • Taylor J.M.
        • Misek D.E.
        • Kardia S.L.R.
        • Giodano T.J.
        • Iannettoni M.D.
        • Orringer M.B.
        • Hanash S.M.
        • Beer D.G.
        Discordant protein and mRNA expression in lung adenocarcinomas.
        Mol. Cell. Proteomics. 2002; 1: 304-313
        • Griffin T.J.
        • Gygi S.P.
        • Ideker T.
        • Rist B.
        • Eng J.
        • Hood L.
        • Aebersold R.
        Complementary profiling of gene expression at the transcriptome and proteome levels in Saccharomyces cerevisiae.
        Mol. Cell. Proteomics. 2002; 1: 323-333
        • Tian Q.
        • Stepaniants S.
        • Mao M.
        • Weng L.
        • Feetham M.C.
        • Doyle M.J.
        • Yi Y.C.
        • Dai H.
        • Thorsson V.
        • Eng J.
        • Goodlett D.
        • Berger J.P.
        • Gunter B.
        • Linseley P.S.
        • Stoughton R.B.
        • Aebersold R.
        • Collins S.J.
        • Hanlon W.A.
        • Hood L.E.
        Integrated genomic and proteomics analyses of gene expression in mammalian cells.
        Mol. Cell. Proteomics. 2004; 3: 960-969
        • McRedmond J.P.
        • Park S.D.
        • Reilly D.F.
        • Coppinger J.A.
        • Maguire P.B.
        • Shields D.C.
        • Fitzgerald D.J.
        Integration of proteomics and genomics in platelets: a profile of platelet proteins and platelet-specific genes.
        Mol. Cell. Proteomics. 2003; 3: 133-144
        • Maziarz M.
        • Chung C.
        • Drucker D.J.
        • Emili A.
        Integrating global proteomics and genomic expression profiles generated from islet α cells: opportunities and challenges to deriving reliable biological inferences.
        Mol. Cell. Proteomics. 2005; 4: 458-474
        • Cox B.
        • Kislinger T.
        • Emili A.
        Integrating gene and protein expression data: pattern analysis and profile mining.
        Methods. 2005; 35: 303-314
        • Gautier L.
        • Mooller M.
        • Friis-Hansen L.
        • Knudsen S.
        Alternative mapping of probes to genes for Affymetrix chips.
        BMC Bioinformatics. 2004; 5: 111
        • Flikka K.
        • Yadetie F.
        • Laegreid A.
        • Jonassen I.
        XHM: a system for detection of potential cross-hybridizations in DNA microarrays.
        BMC Bioinformatics. 2004; 5: 117
        • Pontius J.U.
        • Wagner L.
        • Schuler G.D.
        UniGene: a unified view of the transcriptome.
        in: The NCBI Handbook. National Center for Biotechnology Information, Bethesda, MD2003: 1-12
        • Liu G.
        • Loraine A.E.
        • Shigeta R.
        • Cline M.
        • Cheng J.
        • Valmeekam V.
        • Sun S.
        • Kulp D.
        • Siani-Rose M.A.
        NetAffx: Affymetrix probesets and annotations.
        Nucleic Acids Res. 2003; 31: 82-86
        • McGowan S.J.
        • Terrett J.
        • Brown C.G.
        • Adam P.J.
        • Aldridge L.
        • Allen J.C.
        • Amess B.
        • Andrews K.A.
        • Barnes M.
        • Barnwell D.E.
        • Berry J.
        • Bird H.
        • Boyd R.S.
        • Broughton M.J.
        • Brown A.
        • Bruce J.A.
        • Brusten L.C.M.
        • Draper N.J.
        • Elsmore B.M.
        • Freeman C.D.
        • Giles D.M.
        • Gong H.
        • Gormley D.
        • Griffiths M.R.
        • Hawkes T.D.R.
        • Haynes P.S.
        • Heesom K.J.
        • Herath A.
        • Hollis K.
        • Hudsen L.J.
        • Inman J.
        • Jacobs M.
        • Jarman D.
        • Kibria I.
        • Kilgour J.J.
        • Kinuthia S.K.
        • Lane K.E.
        • Lees M.L.
        • Loader J.
        • Longmore A.
        • McEwan M.
        • Middleton A.
        • Moore S.
        • Murray C.
        • Murray H.M.
        • Myatt C.P.
        • Ng S.S.
        • O’Neil A.
        • Parekh R.B.
        • Patel A.
        • Patel K.B.
        • Patel S.
        • Patel T.P.
        • Philp R.J.
        • Platt A.E.
        • Poyser H.
        • Prendergast C.
        • Prime S.
        • Redpath N.
        • Reeves M.
        • Robinson A.W.
        • Rohlff C.
        • Rosenbaum J.M.
        • Schenker M.
        • Scrivener E.
        • Shipston N.
        • Siddiq S.
        • Southan C.
        • Spencer D.I.R.
        • Stamps A.
        • Steffens M.A.
        • Stevenson D.
        • Sweetman G.M.A.
        • Taylor S.
        • Townsend R.
        • Ventom A.M.
        • Waller M.N.H.
        • Weresch C.
        • Williams A.M.
        • Woolliscroft R.J.
        • Yu X.
        • Lyall A.
        Annotation of the human genome by high-throughput sequence analysis of naturally occurring proteins.
        Curr. Proteomics. 2004; 1: 41-48
        • Rohlff C.
        New approaches towards integrated proteomic databases and depositories.
        Expert Rev. Proteomics. 2004; 1: 267-274
        • Dowel R.D.
        • Jokerst R.M.
        • Day A.
        • Eddy S.R.
        • Stein L.
        The distributed annotation system.
        BMC Bioinformatics. 2001; 2: 7
        • Craig R.
        • Cortens J.P.
        • Beavis R.C.
        Open source system for analyzing, validating, and storing protein identification data.
        J. Proteome Res. 2004; 3: 1234-1242
        • Skrabanek L.
        • Campagne F.
        TissueInfo: high-throughput identification of tissue expression profiles and specificity.
        Nucleic Acids Res. 2001; 29: e102
        • Mu X.
        • Zhao S.
        • Pershad R.
        • Hsieh T.F.
        • Scarpa A.
        • Wang S.W.
        • White R.A.
        • Beremand P.D.
        • Thomas T.L.
        • Gan L.
        • Klein W.H.
        Gene expression in the developing mouse retina by EST sequencing and microarray analysis.
        Nucleic Acids Res. 2001; 29: 4983-4993
        • Yeo G.
        • Holste D.
        • Kreiman G.
        • Burge C.B.
        Variation in alternative splicing across human tissues.
        Genome Biol. 2004; 5: R74
        • Taylor C.F.
        • Paton N.W.
        • Garwood K.L.
        • Kirby P.D.
        • Stead D.A.
        • Yin Z.
        • Deutsch E.W.
        • Selway L.
        • Walker J.
        • Riba-Garcia I.
        • Mohammed S.
        • Deery M.J.
        • Howard J.A.
        • Dunkley T.
        • Aebersold R.
        • Kell D.B.
        • Lilley K.S.
        • Roepstorff P.
        • Yates III, J.R.
        • Brass A.
        • Brown A.J.
        • Cash P.
        • Gaskell S.J.
        • Hubbard S.J.
        • Oliver S.G.
        A systematic approach to modeling capturing and disseminating proteomics experimental data.
        Nat. Biotechnol. 2003; 21: 247-254
        • Pedrioli P.G.
        • Eng J.K.
        • Hubley R.
        • Vogelzang M.
        • Deutsch E.W.
        • Raught B.
        • Pratt B.
        • Nilsson E.
        • Angeletti R.H.
        • Apweiler R.
        • Cheung K.
        • Costello C.E.
        • Hermjakob H.
        • Huang S.
        • Julian R.K.
        • Kapp E.
        • McComb M.E.
        • Oliver S.G.
        • Omenn G.
        • Paton N.W.
        • Simpson R.
        • Smith R.
        • Taylor C.F.
        • Zhu W.
        • Aebersold R.
        A common open representation of mass spectrometry data and its application to proteomics research.
        Nat. Biotechnol. 2004; 22: 1459-1466
        • Orchard S.
        • Zhu W.
        • Julian R.K.
        • Hermjakob H.
        • Apweiler R.
        Further advances in the development of a data interchange standard for proteomics data.
        Proteomics. 2003; 3: 2065-2066
        • MacCoss M.J.
        • McDonald W.H.
        • Saraf A.
        • Sadygov R.
        • Clark J.M.
        • Tasto J.J.
        • Gould K.L.
        • Wolters D.
        • Washburn M.
        • Weiss A.
        • Clark J.I.
        • Yates J.R.
        Shotgun identification of protein modifications from protein complexes and lens tissue.
        Proc. Natl. Acad. Sci. U. S. A. 2002; 99: 7900-7905
        • Choudhary G.
        • Wu S.L.
        • Shieh P.
        • Hancock W.S.
        Multiple enzymatic digestion for enhanced sequence coverage of proteins in complex proteomic mixtures using capillary LC with ion trap MS/MS.
        J. Proteome Res. 2003; 2: 59-67
        • Pan S.
        • Zhang H.
        • Rush J.
        • Eng J.
        • Zhang N.
        • Patterson D.
        • Comb M.J.
        • Aebersold R.
        High throughput proteome-screening for biomarker detection. (2005).
        Mol. Cell. Proteomics. 2005; 4: 182-190
        • Wall D.B.
        • Kachman M.T.
        • Gong S.S.
        • Parus S.J.
        • Long M.W.
        • Lubman D.M.
        Isoelectric focusing nonporous silica reversed-phase high-performance liquid chromatography/electrospray ionization time-of-flight mass spectrometry: a three-dimensional liquid-phase protein separation method as applied to the human erythroleukemia cell-line.
        Rapid Commun. Mass Spectrom. 2001; 15: 1649-1661
        • Liu H.
        • Berger S.J.
        • Chakraborty A.B.
        • Plumb R.S.
        • Cohen S.A.
        Multidimensional chromatography coupled to electrospray ionization time-of-flight mass spectrometry as an alternative to two-dimensional gels for the identification and analysis of complex mixtures of intact proteins.
        J. Chromatogr. B Anal. Technol. Biomed. Life Sci. 2002; 782: 267-289
        • Wienkoop S.
        • Glinski M.
        • Tanaka N.
        • Tolstikov V.
        • Fiehn O.
        • Weckwerth W.
        Linking protein fractionation with multidimensional monolithic reversed-phase peptide chromatography/ mass spectrometry enhances protein identification from complex mixtures even in the presence of abundant proteins.
        Rapid Commun. Mass Spectrom. 2004; 18: 643-650
        • Moritz R.L.
        • Ji H.
        • Schutz F.
        • Connolly L.M.
        • Kapp E.A.
        • Speed T.P.
        • Simpson R.J.
        A proteome strategy for fractionating proteins and peptides using continuous free-flow electrophoresis coupled off-line to reversed-phase high-performance liquid chromatography.
        Anal. Chem. 2004; 76: 4811-4824
        • Lee S.W.
        • Berger S.J.
        • Martinovic S.
        • Pasa-Tolic L.
        • Anderson G.A.
        • Shen Y.
        • Zhao R.
        • Smith R.D.
        Direct mass spectrometric analysis of intact proteins of the yeast large ribosomal subunit using capillary LC/FTICR.
        Proc. Natl. Acad. Sci. U. S. A. 2002; 99: 5942-5947
        • VerBerkmoes N.C.
        • Bundy J.L.
        • Hauser L.
        • Asano K.G.
        • Razumovskaya J.
        • Larimer F.
        • Hettich R.L.
        • Stephenson Jr., J.L.
        Integrating “top-down” and “bottom-up” mass spectrometric approaches for proteomic analysis of Shewanella oneidensis.
        J. Proteome Res. 2002; 1: 239-252
        • Strader M.B.
        • VerBerkmoes N.C.
        • Tabb D.L.
        • Connelly H.M.
        • Barton J.W.
        • Bruce B.D.
        • Pelletier D.A.
        • Davison B.H.
        • Hettich R.L.
        • Larimer F.W.
        • Hurst G.B.
        Characterization of the 70S ribosome from Rhodopseudomonas palustris using an integrated “top-down” and “bottom-up” mass spectrometric approach.
        J. Proteome Res. 2004; 3: 965-978
        • Nemeth-Cawley J.F.
        • Tangarone B.S.
        • Rouse J.C.
        “Top down” characterization is a complementary technique to peptide sequencing for identifying protein species in complex mixtures.
        J. Proteome Res. 2003; 2: 495-505
        • Wang H.
        • Kachman M.T.
        • Schwartz D.R.
        • Cho K.R.
        • Lubman D.M.
        Comprehensive proteome analysis of ovarian cancers using liquid phase separation, mass mapping and tandem mass spectrometry: a strategy for identification of candidate cancer biomarkers.
        Proteomics. 2004; 4: 2476-2495
        • Humphery-Smith I.
        A human proteome project with a beginning and an end.
        Proteomics. 2004; 4: 2519-2521
        • Uhlen M.
        • Ponten F.
        Antibody-based proteomics for human tissue profiling.
        Mol. Cell. Proteomics. 2005; 4: 384-393
        • Prince J.T.
        • Carlson M.W.
        • Wang R.
        • Lu P.
        • Marcotte E.M.
        The need for a public proteomics repository.
        Nat. Biotechnol. 2004; 22: 471-472