MCP Tips for better browsing
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


Originally published In Press as doi:10.1074/mcp.R500012-MCP200 on July 11, 2005.
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow All Versions of this Article:
R500012-MCP200v1
4/10/1419    most recent
Right arrow Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow Glossary
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Nesvizhskii, A. I.
Right arrow Articles by Aebersold, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Nesvizhskii, A. I.
Right arrow Articles by Aebersold, R.
Molecular & Cellular Proteomics 4:1419-1440, 2005.
© 2005 by The American Society for Biochemistry and Molecular Biology, Inc.


Tutorial

Interpretation of Shotgun Proteomic Data

The Protein Inference Problem*

Alexey I. Nesvizhskii{ddagger},§ and Ruedi Aebersold{ddagger}

From the {ddagger} Institute for Systems Biology, Seattle, Washington 98103 and Institute for Molecular Systems Biology, ETH-Zurich, CH-8093 Zurich, Switzerland


    ABSTRACT
 TOP
 ABSTRACT
 THE PROTEIN INFERENCE PROBLEM:...
 ASSEMBLING PEPTIDES INTO...
 COMPUTATIONAL TOOLS
 PROTEIN SEQUENCE DATABASES
 IDENTIFICATION OF MATURE FORMS...
 QUANTITATIVE PROTEOMICS
 INTEGRATION OF PROTEOMIC AND...
 INTEGRATION OF MULTIPLE SHOTGUN...
 CONCLUDING REMARKS
 REFERENCES
 
The shotgun proteomic strategy based on digesting proteins into peptides and sequencing them using tandem mass spectrometry and automated database searching has become the method of choice for identifying proteins in most large scale studies. However, the peptide-centric nature of shotgun proteomics complicates the analysis and biological interpretation of the data especially in the case of higher eukaryote organisms. The same peptide sequence can be present in multiple different proteins or protein isoforms. Such shared peptides therefore can lead to ambiguities in determining the identities of sample proteins. In this article we illustrate the difficulties of interpreting shotgun proteomic data and discuss the need for common nomenclature and transparent informatic approaches. We also discuss related issues such as the state of protein sequence databases and their role in shotgun proteomic analysis, interpretation of relative peptide quantification data in the presence of multiple protein isoforms, the integration of proteomic and transcriptional data, and the development of a computational infrastructure for the integration of multiple diverse datasets.


An explicit goal of proteomics is the identification and quantification of all the proteins expressed in a cell or tissue (1). Although not yet at the levels of data throughput and automation achieved in other genomic analyses such as DNA sequencing or microarray gene expression analysis, global protein profiling methods are rapidly evolving. This has been possible because of recent improvements in MS instrumentation, protein and peptide separation techniques, computational data analysis tools, and the availability of complete sequence databases for many species. As a result, analysis of complex protein mixtures using shotgun proteomics, a strategy based on the combination of protein digestion and MS/MS-based peptide sequencing (24), has become widely adopted. The method allows protein identifications and, when combined with stable isotope labeling, quantification of the changes in the protein expression levels for hundreds of proteins in a single experiment (1).

Compared with other MS-based proteomic technologies such as intact proteins sequencing (5, 6) or 2D1 gel-based protein analysis (7), shotgun proteomic analysis has achieved a relatively high throughput. This is the result of a combination of several factors. Proteolytic digestion of proteins into shorter peptides simplifies MS/MS sequencing (peptides are easier to fragment in the mass spectrometer than intact proteins), whereas elimination of the 2D gel-based separation at the protein level simplifies sample handling and increases the overall data throughput. At the same time, computational analysis and interpretation of the data become more challenging (813). The first and foremost computational challenge is the need to process large volumes of acquired MS/MS data with the purpose of identifying peptides that gave rise to observed spectra. This challenge is now well understood, and a number of computational methods and software tools, including programs for assigning peptides to MS/MS spectra (1420) and for statistical validation of those assignments (2125), have been developed. However, identification of peptides resulting from proteolytic digestion of sample proteins represents only an intermediate step because the ultimate goal of most experiments is to identify (and quantify when appropriate) the proteins that are present in the original sample. Increasingly it has been realized that the protein inference problem, i.e. the task of assembling the sequences of identified peptides to infer the protein content of the sample, is far from being trivial and requires special attention (22, 2630).

The difficulty of assembling peptide identifications back to the protein level results from the same factors that made the shotgun proteomic approach so successful in the first place, i.e. protein digestion at an early stage of the process and elimination of extensive separation at the protein level. Protein digestion makes peptides, and not the proteins, the currency of the method, and the connectivity between peptides and proteins is lost at the digestion stage. This loss of connectivity complicates computational analysis and biological interpretation of the data especially in the case of higher eukaryote organisms. The same peptide sequence can be present in multiple different proteins. Therefore, the identification of such shared2 peptides can lead to ambiguities in the determination of the identities of the sample proteins (see Fig. 1). In general, protein identification is a less complex issue3 if proteins are first separated using a multidimensional protein separation technique (e.g. 2D gels) where additional information, such as the protein molecular weight and isoelectric point, can assist in determination of the protein identities (see e.g. Refs. 7 and 3133).



View larger version (39K):
[in this window]
[in a new window]
 
FIG. 1. Protein identification using MS/MS. On the left is a depiction of the process in a typical 2D gel-based approach: the proteins are separated, visualized, and excised from the gel. On the right is the process involved for shotgun proteomics. In both approaches, sample proteins are proteolytically digested into peptides, and resulting peptides mixtures are separated using liquid chromatography. Peptides are ionized, and selected peptide ions are subjected to MS/MS sequencing. Peptide sequences are determined from MS/MS spectra using a database search approach. Any given peptide may be a part of the sequences of several different proteins. The protein inference problem involves figuring out which proteins are present in the sample given the sequences of identified peptides. In this example, the sample contains two proteins, A and B, which share extensive sequence homology. All three identified peptides, AEMK, GAGGLR, and HYFEDR, are present in the sequence of protein B, and the last two peptides are also in the sequence of protein A. In the shotgun proteomic approach, the connectivity between peptides and proteins is lost; no information on the number of proteins in the sample or their properties (e.g. molecular weight) is available. It is not possible to conclude that protein A is present in the sample because protein B can account for all observed peptides. This is less of a problem in the case of the 2D gel-based approach where proteins are separated prior to digestion and MS/MS analysis.

 
In this article we illustrate the protein inference problem of shotgun proteomics using a set of examples and discuss the need for common nomenclature and transparent informatic approaches for assembling peptides into proteins and presenting the results of shotgun proteomic experiments to the user. We also discuss related issues such as the state of protein sequence databases and their role in shotgun proteomic analysis, interpretation of quantitative proteomic data in the presence of multiple protein isoforms, correlation of proteomic and transcriptional data, and comparison and integration of shotgun proteomic data generated in different experiments.


    THE PROTEIN INFERENCE PROBLEM: CASE STUDIES
 TOP
 ABSTRACT
 THE PROTEIN INFERENCE PROBLEM:...
 ASSEMBLING PEPTIDES INTO...
 COMPUTATIONAL TOOLS
 PROTEIN SEQUENCE DATABASES
 IDENTIFICATION OF MATURE FORMS...
 QUANTITATIVE PROTEOMICS
 INTEGRATION OF PROTEOMIC AND...
 INTEGRATION OF MULTIPLE SHOTGUN...
 CONCLUDING REMARKS
 REFERENCES
 
Although the shotgun proteomic approach is peptide-centric, in most cases researchers are ultimately interested in knowing what proteins are present in the analyzed sample. Inferring protein identities given a set of identified peptides becomes difficult in the case of higher eukaryote organisms. This is due to sequence redundancy, i.e. the presence of distinct proteins having a high degree of sequence homology, as is the case in protein families, alternative splice forms of the same gene, differentially processed proteins, and more (26, 34).

Although the identification of a single peptide is often sufficient to conclude that a product of a certain gene is present in the sample, it is often not possible to discriminate between different proteins that share extensive homology or are isoforms arising from alternatively spliced genes as illustrated in Fig. 2. All examples used in this work, except where noted, were found in the dataset from an experiment on lipid raft plasma membrane domains from human Jurkat human T cells (27).4 In the first example, Fig. 2A, several peptides were identified that are present in two different splice forms of the F-actin capping protein ß subunit, CAPB_HUMAN, P47756-1 and P47756-2. Because there are no peptides identified in the experiment that would correspond to one of the isoforms only, both isoforms are equally likely; any one of them, or both, could be present in the sample. In this particular case, the sequences of the two isoforms differ significantly at the C terminus. Thus, discrimination between these two isoforms using sequence information alone would be possible only with the identification of peptides spanning the areas where the sequences diverge.



View larger version (45K):
[in this window]
[in a new window]
 
FIG. 2. Sequences of identified peptides often do not allow discrimination between different protein isoforms. A, multiple peptides are identified that are present in two different splice forms, P47756-1 and P47756-2, of the F-actin capping protein ß subunit. The alignment of the sequences of the two isoforms is shown, and the sequences of the identified peptides are shown in bold. The isoforms are indistinguishable given the available data. The discrimination would be possible if peptides spanning the areas where the sequences diverge, e.g. SIDAIPDNQK (unique to P47756-1) and SVQTFADK (unique to P47756-2), were identified. B, protein isoforms of epithelial protein lost in neoplasm. Three isoforms result from splicing of several consecutive exons located at the 5'-end in the gene sequence. Given the sequences of the identified peptides (shown in bold), it is not possible to determine precisely which isoform is present. The sequences of the two shorter isoforms (Q9UHB6-2 and Q9UHB6-3) are included in the sequence of the longer isoform (Q9UHB6-1). Identification of a peptide from the region present in the isoform ß only (sequence shown in a box) would allow conclusive identification of this isoform (no such peptides were actually observed in the experiment). Conclusive identification of the shorter isoforms would be difficult because they do not contain any unique sequence.

 
Conclusive identification of alternative splice forms arising from skipping of one or more consecutive exons at the 5'- or 3'-end of the gene sequence (without introduction of any divergent sequence) is more challenging. One such example is shown in Fig. 2B. The sequences of the two shorter isoforms of the epithelial protein lost in neoplasm, EPLI_HUMAN (Q9UHB6-2 and Q9UHB6-3, isoforms {alpha} and 3, respectively) are included in the sequence of the longer isoform (Q9UHB6-1, isoform ß). No conclusive evidence for the presence of the shorter isoforms in the sample can be obtained without any additional information, e.g. molecular weight of the sample proteins.5

Furthermore in some cases it is not possible to discriminate between proteins that are products of different genes from the same gene family (gene paralogues) (22, 26, 35, 36). This is illustrated in Fig. 3. A total of 11 peptides were identified in the dataset that are shared between more than a dozen different members of the {alpha}-tubulin family. None of the identified peptides is unique to any of the proteins. Thus, although those peptides clearly indicate the presence of one or more {alpha}-tubulin proteins, it is not possible to determine which particular member(s) of that family is present in the sample.



View larger version (20K):
[in this window]
[in a new window]
 
FIG. 3. An example of a protein family. Eleven tryptic peptides are identified that are shared between the members of the {alpha}-tubulin family. None of the proteins is identified by a peptide that is unique to it, thus making it impossible to determine which particular member(s) of the family is present in the sample.

 
Interpretation of the data is further complicated due to artificial redundancies, e.g. truncated sequences, sequence alternatives arising from sequencing errors, and existence of essentially the same sequences under different gene names, features that exist in many protein sequence databases. This is illustrated in Fig. 4 where a group of peptides was identified that are shared between four different entries in the Entrez Protein sequence database maintained by the National Center for Biotechnology Information (NCBI).6 Manual examination revealed that all four database entries represent the same protein, heat shock 70-kDa protein 9B (HSPA7B). Three of those entries are derived from mRNA sequences containing small sequence variations. In this particular case, the sequence variations are likely due to sequencing errors. However, these variations could also be polymorphisms, i.e. real sequence variants of the same protein from a different individual. In many cases, such database redundancies can only be resolved on a case by case basis by researchers analyzing the data. This presents an additional challenge for the development of automated informatic tools for dealing with large scale proteomic datasets.



View larger version (35K):
[in this window]
[in a new window]
 
FIG. 4. Sequence database redundancies complicate the analysis of shotgun proteomic data. Four separate entries in the Entrez Protein database represent the same protein, heat shock 70-kDa protein 9B. Three of them (entries 2–4) are derived from mRNA sequences containing small sequence variations.

 
Another problem encountered in shotgun proteomics is the difficulty of assigning the correct peptide sequence to an MS/MS spectrum. Two of the amino acids (Ile/Leu) have identical masses, and the difference between several other amino acid combinations (e.g. Asp/Asn and Glu/Gln/Lys) cannot be resolved using low mass accuracy instruments such as the commonly used ion traps. If the database contains several peptides with a similar molecular weight and having a high degree of sequence homology, determination of the correct peptide sequence among the alternatives becomes difficult or even impossible (in the case of Ile/Leu substitutions). This can result in the assignment of incorrect (although homologous) peptide sequences to MS/MS spectra, which in turn can result in incorrect protein identifications. Some but not all ambiguities can be resolved when using high mass accuracy instruments such as LTQ-FT or even Q-TOF.

Distinguishing between different proteins having a high degree of sequence similarity becomes increasingly difficult with decreasing protein sequence coverage, i.e. the fraction of the protein sequence covered by the identified peptides. The protein coverage observed in shotgun proteomic experiments is typically low. The number of "identifiable" peptides is obviously limited by the size of the protein but also by such factors as the enzymatic digestion constraint and the detection mass range of the mass spectrometer. In some cases, the pool of potential peptide identifications is further reduced as a result of selective enrichment for a particular class of peptides, e.g. cysteine-containing peptides in quantitative proteomic experiments based on ICAT reagents (3). The identification of some peptides can be prevented by unexpected post-translational modifications. Furthermore because multiple different peptide ions are injected in the mass spectrometer (operated in top-down ion selection mode) at any given time, low intensity ions, produced by low abundance or poorly ionizing peptides, are less likely to be selected for MS/MS sequencing (1). Finally some peptides, due to their physical-chemical properties, cannot be efficiently ionized or fragment in an atypical way producing MS/MS spectra unidentifiable by the current database search tools. As a result, more than 30% of all proteins that are detected in a typical shotgun proteomic experiment, including many low molecular weight or low abundance proteins, are identified by a single peptide.


    ASSEMBLING PEPTIDES INTO PROTEINS
 TOP
 ABSTRACT
 THE PROTEIN INFERENCE PROBLEM:...
 ASSEMBLING PEPTIDES INTO...
 COMPUTATIONAL TOOLS
 PROTEIN SEQUENCE DATABASES
 IDENTIFICATION OF MATURE FORMS...
 QUANTITATIVE PROTEOMICS
 INTEGRATION OF PROTEOMIC AND...
 INTEGRATION OF MULTIPLE SHOTGUN...
 CONCLUDING REMARKS
 REFERENCES
 
Results of large scale proteomic experiments are often presented as lists of protein identifications. At present, significant inconsistencies exist in the way different research groups assign peptides to proteins and deal with biological and database redundancies. The criteria for calling a protein "identified" are not always described, and there is no generally accepted way to do it. Shared peptides (peptides present in more than one sequence database entry) are sometimes assigned to a particular protein among several possibilities in a random fashion. Different sequence database entries could be counted as separate protein identifications when in fact all of them share the same set of peptides and, therefore, are indistinguishable. In most cases not only do these redundancies inflate the total number of proteins reported as identified, but they can also lead to incorrect biological interpretation of the data. The problem is further complicated when no statistical analysis is performed to determine the validity of peptide and protein identifications (10, 29). Thus, there is a need to develop a common nomenclature and a set of guidelines for assigning peptides to proteins and for interpreting resulting protein identification datasets.

The nomenclature described below provides a consistent way for presenting the results of large scale proteomic experiments. In creating a protein summary list that accurately represents the data, various peptide grouping scenarios have to be considered that are schematically illustrated in Fig. 5 (22, 30). The diagram in Fig. 5a describes a case of two distinct proteins, A and B, each identified by distinct7 peptides only, i.e. peptides corresponding to that one protein and no other proteins (peptides 1 and 2 are unique to protein A, and peptides 3 and 4 are unique protein B). Fig. 5b shows a case of two differentiable proteins, which are identified by at least one distinct peptide (peptide 1 is unique to A, and peptide 4 is unique to protein B) but also by one or more shared peptides (peptides 2 and 3 are shared between the two proteins). A different scenario is shown in Fig. 5c where all peptides are shared between proteins A and B. These two proteins are indistinguishable given the sequences of the identified peptides, and either protein A, protein B, or both can be present in the sample. Fig. 5, d and e, each show a situation where all identified peptides corresponding to protein B are shared and can be accounted for by another protein (protein A in Fig. 5d) or a combination of several other proteins (proteins A and C in Fig. 5e) certain to be in the sample because they are identified by at least one distinct peptide. In general, no conclusion can be made regarding the presence of a subset (protein B in Fig. 5d) or a subsumable (Protein B in Fig. 5e) protein in the sample. A special case is shown in Fig. 5f where all identified peptides are shared by a group of proteins. The presence of protein A in the sample is sufficient to explain all observed peptides (B and C are subset protein identifications). Although protein A is the most likely candidate, its presence in the sample is not required to explain the data; it is identified by shared peptides only. In the absence of protein A, a combination of proteins B and C would account for all four peptides. Such situations are often observed in the case of extended protein families, such as the tubulin example shown in Fig. 3. The examples discussed above are exhaustive, i.e. it should be possible to explain more complicated cases observed in real datasets by reducing them to a combination of several basic grouping scenarios.



View larger version (22K):
[in this window]
[in a new window]
 
FIG. 5. Basic peptide grouping scenarios. a, distinct protein identifications. b, differentiable protein identifications. c, indistinguishable protein identifications. d, subset protein identification. e, subsumable protein identification. f, an example of a protein group where one protein can explain all observed peptides, but its identification is not conclusive.

 
The nomenclature described here, coupled with the Occam’s razor constraint (22), would provide a minimal list of proteins sufficient to explain all observed peptides. Such a minimal list would contain all distinct and differentiable proteins, e.g. proteins A and B in Fig. 5, a and b, and proteins A and C in Fig. 5e but no subsumable or subset proteins, e.g. only protein A would be included in the list in the cases shown in Fig. 5, d and f. In the case of indistinguishable protein identifications, Fig. 5c, it would be most accurate to collapse all such identifications into a single entry in the protein summary report as there is often no basis to eliminate any of them.

Presenting results of large scale shotgun experiments in terms of such minimal lists of protein identifications has several advantages. It significantly simplifies the interpretation of the data by allowing the user to focus on proteins that are conclusively determined to be present in the sample. It also allows calculation of a consistent measure for the number of proteins identified in the experiment as the smallest number of proteins that can explain all observed peptides (i.e. the number of entries in the minimal protein list).

At the same time, presenting only the minimal list of proteins has limitations. For example, a researcher interested in a particular gene might want to observe all related protein isoforms annotated in the protein sequence database that are implicated by at least one peptide identified in the experiment. Moreover the strict implementation of the Occam’s razor approach can be misleading when applied to complex protein families. In the {alpha}-tubulin protein family example shown in Fig. 3, none of the identified peptides are unique to the tubulin {alpha}-1 protein. Thus, although this protein can explain all observed peptides, its identification is not conclusive. In fact, in the absence of the {alpha}-1 tubulin, all peptides can be accounted for by a combination of several other tubulins, e.g. {alpha}-3 and {alpha}-6. Because it is not possible to determine which particular member(s) of that family is present in the sample, in creating a minimal list it is more accurate and informative to present all members together as a group (22). Therefore, the most advantageous presentation would include the following: (a) a minimal list with indistinguishable proteins collapsed into a single entry (but showing all protein names) and with all members of protein groups listed and (b) means to observe the proteins implicated by at least one peptide that cannot be called conclusively identified. A simplified illustration of such a format of presentation is shown in Fig. 6.



View larger version (30K):
[in this window]
[in a new window]
 
FIG. 6. A simplified example of a protein summary list. Peptides are apportioned among all their corresponding proteins, and the minimal list of proteins is derived that can explain all observed peptides. Proteins that are impossible to differentiate on the basis of identified peptides are collapsed into a single entry (F and G) or presented as a group (H, I, and J). Shared peptides are marked with an asterisk. Proteins that cannot be conclusively identified are shown at the end of the list but do not contribute toward the protein count.

 

    COMPUTATIONAL TOOLS
 TOP
 ABSTRACT
 THE PROTEIN INFERENCE PROBLEM:...
 ASSEMBLING PEPTIDES INTO...
 COMPUTATIONAL TOOLS
 PROTEIN SEQUENCE DATABASES
 IDENTIFICATION OF MATURE FORMS...
 QUANTITATIVE PROTEOMICS
 INTEGRATION OF PROTEOMIC AND...
 INTEGRATION OF MULTIPLE SHOTGUN...
 CONCLUDING REMARKS
 REFERENCES
 
A number of computational tools for assembling peptides into proteins in large scale shotgun proteomic experiments have been described (22, 28, 30, 37, 38). In general, the process of peptide assembly consists of the following steps. First, peptide assignments obtained by searching acquired MS/MS spectra against a protein sequence database using algorithms such as SEQUEST (14) or Mascot (16) are filtered using a user-specified set of criteria to remove false identifications. Second, accession numbers and annotations of protein sequence database entries corresponding to each peptide are retrieved from the sequence database. Third, peptides are grouped by their corresponding sequence database entries. Fourth, shared peptides are apportioned among all corresponding proteins, and a summary protein list is created. Ideally the apportionment of peptides to proteins should be done using a probability-based approach, i.e. taking into account the probabilities of peptide assignments (22). This has an advantage in that it allows calculation of statistical confidence measures for protein identifications and estimation of false identification error rates resulting from filtering the data (10, 22).

The format in which the results of shotgun proteomic experiments are presented to the user varies between the tools. In ProteinProphet (22), each separate entry in the protein summary file is assigned a probability that the corresponding protein is present in the sample. Indistinguishable proteins are collapsed into a single entry, and all members of protein groups, such as the {alpha}-tubulin family shown in Fig. 3, are presented together. All subset and subsumable protein entries are assigned zero probability, which is to be interpreted as the absence of conclusive evidence for the presence of those proteins in the sample. The subset and subsumable protein entries can be located and viewed using interactive web-based options. In the Experimental Peptide Identification Repository (EPIR) (38), the notion of protein groups introduced in Ref. 22 is extended, and all entries with shared peptides are organized into a single group. The protein that contains most of the peptides is selected as an anchor, and all group members that are identified by at least one distinct peptide are marked as conclusively identified. Additional visualization tools, e.g. a tool for aligning the sequences of all proteins within a protein group, are provided to assist in the interpretation of the data. Other software tools such as Isoform Resolver (28) and DBParser (30) create protein summary lists containing all protein sequence database entries identified by at least one peptide with proteins that share a set of peptides placed adjacent to each other. In Isoform Resolver, the protein summary lists are presented in a text format, and a peptide-centric numbering scheme is used to specify what proteins are identified conclusively. DBParser outputs the results in an interactive web-based format that allows the user to view both the redundant and the minimal list of proteins.

DTASelect (39) is another widely used tool for processing of shotgun proteomic data. However, it does not provide any statistical confidence measures for protein and peptide identifications, and its approach for assembling peptides into proteins in the presence of shared peptides has not been fully described. In addition, new tools are being developed at increasing speed, including commercial programs that combine the process of peptide identification and the subsequent assembly of peptides into proteins (40).8 This diversity of computational tools, a positive development reflecting the increased used of shotgun proteomics, nevertheless presents a significant challenge for developing any kind of standards for the analysis and journal publication of proteomic datasets (10, 29). It is thus essential that the computational tools are made transparent (published) and extensively tested and that the methods for assembling peptides into proteins and presenting the results all follow the same set of general guidelines such as those described in this article.


    PROTEIN SEQUENCE DATABASES
 TOP
 ABSTRACT
 THE PROTEIN INFERENCE PROBLEM:...
 ASSEMBLING PEPTIDES INTO...
 COMPUTATIONAL TOOLS
 PROTEIN SEQUENCE DATABASES
 IDENTIFICATION OF MATURE FORMS...
 QUANTITATIVE PROTEOMICS
 INTEGRATION OF PROTEOMIC AND...
 INTEGRATION OF MULTIPLE SHOTGUN...
 CONCLUDING REMARKS
 REFERENCES
 
Computational analysis and biological interpretation of shotgun proteomic data requires selection of a reference protein sequence database. For some organisms, e.g. human, several different databases exist that vary in terms of completeness, degree of redundancy, and quality of sequence annotation (42). Table I and Fig. 7 summarize some of the existing protein sequence databases that are commonly used with mass spectrometry data. The choice of a particular database should be based on the goals of the experiment.


View this table:
[in this window]
[in a new window]
 
TABLE I Summary of the protein sequence databases that are commonly used in shotgun proteomic analysis

Database sizes and the number of sequences are given for the human subset of each database only. EBI, European Bioinformatics Institute; SIB, Swiss Institute of Bioinformatics.

 


View larger version (28K):
[in this window]
[in a new window]
 
FIG. 7. Protein sequence databases differ in terms of their completeness and the degree of sequence redundancy. A, the total number of tryptic peptides with no missed cleavages (Ntot) and the number of unique sequences among them (Nunique), in the range of molecular weights between 600 and 3000, in each of the human protein sequence databases listed in Table I. B, a measure of the database sequence redundancy (average number of database entries containing each unique tryptic peptide sequence), estimated by taking the ratio Ntot/Nunique, plotted as a function of peptide molecular weight (bin size of 50 mass units) for the same databases. SP, Swiss-Prot.

 
When peptides are assigned to MS/MS spectra using the database search approach, the universe of all potential peptide assignments is limited to the sequences present in the searched protein sequence database. The completeness of the sequence database thus can be a decisive factor in experiments where identification of sequence polymorphisms is crucial for the biological interpretation of the data. In those cases, a large database such as Entrez Protein (also known as the non-redundant NCBI database, NCBInr) (43) would have an advantage over smaller databases such as Uni-Prot/Swiss-Prot (44) or RefSeq (45). The Entrez Protein database, for example, contains twice as many unique tryptic peptide sequences as Uni-Prot/Swiss-Prot (Fig. 7A). At the same time, large sequence databases contain, in addition to true biologically significant sequence variants, numerous artificial redundancies arising e.g. from partial mRNAs or sequencing errors (see example in Fig. 4). Fig. 7B plots the average number of database entries containing each unique tryptic (with no missed cleavages) peptide sequence as a function of peptide molecular weight. For example, in the range of molecular weights around 1000, the majority of tryptic peptides in the Swiss-Prot database are distinct (Ntot/Nunique ~ 1), whereas in the Entrez Protein database each peptide is present on average in three different entries (Ntot/Nunique ~ 3). In the absence of good sequence annotation in large protein sequence databases such as Entrez Protein database, it becomes necessary to perform time-consuming manual analysis and elimination of database redundancies. Furthermore searching such large databases makes it more difficult to separate the correct from random (incorrect) peptide assignments to MS/MS spectra.

When the quality of the sequence annotation and the ease of data interpretation are more important than the ability to identify sequence variants, it is more appropriate to use well curated databases such as Swiss-Prot or RefSeq. A good balance between the completeness and the level of redundancy is found in the International Protein Index (IPI) database (46), which is available for a number of organisms including human and mouse. The sequence- and identifier-based construction of this database significantly reduces the need for manual filtering while maintaining cross-references to all its source data, which include Ensembl (47), Uni-Prot (Swiss-Prot and its supplement TrEMBL) (44), and RefSeq (45). Minor sequence variants, however, are not represented in the IPI database.

Genomic databases can also be used for MS/MS database searching (48, 49), which can lead to the identification of novel alternative splice forms and sequence polymorphisms not present in the protein sequences databases. However, this type of computational analysis can be computer-intensive because of the large size of those databases. It is also complicated due to frameshifts, incorrectly predicted open reading frames, and poor quality of many EST sequences. This combined with the poor quality of many experimental MS/MS spectra can lead to high numbers of false identifications. A more efficient strategy for the identification of novel alternative splice forms or sequence variants is to perform computational analysis in an iterative fashion. In this approach, the analysis would start with searching MS/MS spectra against a well annotated database (e.g. RefSeq or IPI). The high quality spectra left unassigned in the initial search are then reanalyzed more extensively, first searching for post-translationally modified peptides and only then against large genomic databases.9

An important caveat to keep in mind when interpreting shotgun proteomic data is that the protein sequence databases are constantly in flux especially with regard to minor sequence variants, alternative splice forms, and other less well characterized gene products. With each new database update, some protein sequences disappear, the annotation and accession numbers of the remaining sequences can change, and new sequences can be added. The instability of the current sequence databases is largely due to a substantial amount of work being carried out to improve their completeness and the quality of sequence annotation, a process that is likely to continue for a significant period of time. This has significant implications in that interpretation of the MS/MS-based proteomic data, e.g. assignment of peptides to entries in the protein sequence database and conclusions about the presence of a particular protein isoform in the sample, depends on the version of the protein sequence database used in the analysis. Frequent updating of the sequence databases by the database providers can complicate ongoing proteomic experiments. Researchers using these databases in the analysis of their data often have to reanalyze previously acquired and processed MS/MS spectra using a new version of the database or develop bioinformatic tools for automated mapping of peptide sequences, identified by searching MS/MS spectra against an older version of the database, to the latest version of that database. It is important to note that the data coming from MS-based proteomic experiments can itself be used to assist in the process of improving protein sequence databases provided a mechanism is developed for communicating the sequences of peptides identified by searching those databases back to the database developers and annotators (50, 51).


    IDENTIFICATION OF MATURE FORMS OF PROTEINS
 TOP
 ABSTRACT
 THE PROTEIN INFERENCE PROBLEM:...
 ASSEMBLING PEPTIDES INTO...
 COMPUTATIONAL TOOLS
 PROTEIN SEQUENCE DATABASES
 IDENTIFICATION OF MATURE FORMS...
 QUANTITATIVE PROTEOMICS
 INTEGRATION OF PROTEOMIC AND...
 INTEGRATION OF MULTIPLE SHOTGUN...
 CONCLUDING REMARKS
 REFERENCES
 
The discussion so far has mostly focused on the problem of assigning peptides to proteins and distinguishing between different protein forms whose sequences are present in the protein sequence database. A closely related issue is the difficulty of using shotgun proteomic data to provide conclusive information regarding the mature form of the sample proteins. First, most existing protein sequence databases contain entries that are derived from full cDNAs encoding preprocessed forms. Thus, they do not typically contain the mature forms derived from various post-translational processing mechanisms, e.g. removal of the leading methionine, cleavage of the signal or transit peptide, etc. Second, even if all mature forms were annotated in the protein sequence database, distinguishing between different protein isoforms would be difficult. For example, a mere observation that none of the identified peptides are coming from the N-terminal region of the protein does not necessarily indicate the cleavage of the presequence. It can be explained by other factors, e.g. the absence of identifiable tryptic peptides in that region.

In some cases, post-translational processing events can be inferred using the knowledge regarding the specificity of the proteolytic enzyme used to digest proteins into peptides. For example, the enzyme trypsin cleaves after arginine and lysine residues. A peptide resulting from trypsin digestion should contain Lys or Arg at its C terminus (unless it is located at the C terminus of the protein), and in the sequences of its corresponding protein the residue immediately preceding the peptide should also be Lys or Arg (or the peptide is located at the N terminus). Thus, identification of a peptide whose sequence does not adhere to the enzymatic digestion constraint at one of its termini could indicate that the mature form of the protein is present in the sample. One such example is shown in Fig. 8A where identification of a "partially tryptic" peptide (not tryptic at its N terminus), assigned to the "Basigin precursor" database entry (Swiss-Prot accession number P35613), suggests that the mature form of that protein resulting from the proteolytic cleavage of the 22-residue-long signal peptide is present in the biological sample. In general, identifying peptides that are not tryptic (assuming no protein cleavage) and are located close to the N terminus of the protein can be a useful strategy for inferring signal peptide cleavage sites or other proteolytic cleavage events, thus confirming, refining, or adding to the annotations currently available in the protein databases such as Swiss-Prot. In some cases, it can also assist in discrimination between different protein isoforms resulting from alternative splicing. It should be noted, however, that some partially tryptic peptides can be observed due to in-source or in-solution fragmentation of the originally tryptic peptides. Thus, conclusions based on the observation of partially tryptic peptides require additional scrutiny.



View larger version (40K):
[in this window]
[in a new window]
 
FIG. 8. Identification of mature forms of proteins in shotgun proteomics. A, identification of a partially tryptic peptide, AAGTVFTTVEDLGSK, indicates the removal of the 22-residue-long signal peptide. B, identification of two partially tryptic peptides, RPLVASVGLNVPASVCY and SHTDIKVPDFSEYR, located adjacent to each other in the protein sequence indicates the cleavage of the 78-amino acid presequence of the ubiquinol-cytochrome c reductase iron-sulfur subunit. The processed presequence remains as a subunit of a protein complex.

 
Another example is shown in Fig. 8B where several peptides were identified and assigned to a single protein sequence database entry "ubiquinol-cytochrome c reductase iron-sulfur subunit, mitochondrial precursor" (Swiss-Prot accession number P47985, Rieske protein). Two of the peptides, RPLVASVGLNVPASVCY and SHTDIKVPDFSEYR, are partially tryptic and located adjacent to each other in the protein sequence, which suggests the cleavage of the 78-amino acid presequence (annotated in Swiss-Prot as "transit" peptide). Interestingly the identification of several peptides assigned to the N-terminal region of the protein indicates that the presequence has not been degraded. This observation is consistent10 with the results of previous studies suggesting that the Rieske protein in the mammalian systems is processed in a single proteolytic step after it becomes associated with the cytochrome bc1 complex and that the processed presequence remains as a subunit of the complex (52).

The strategy for the detection of proteolytic cleavage events, described above, relies on the identification of the N- and C-terminal peptides. However, in shotgun analysis of complex protein mixtures, the protein coverage (the number of identified peptides per protein) is typically low especially in the case of low abundance proteins. Thus, such events would only be detected for a fraction of all proteins, typically those of high abundance. The efficiency of the method can be improved by using targeted protein identification strategies designed to increase the likelihood of identifying N- and C-terminal peptides. One such strategy is based on isolation of N-terminal peptides from in vivo N-terminus-blocked proteins using fractional diagonal chromatography (53). The method can be further improved by optimizing the computational MS/MS data interpretation strategies to specifically look for peptides indicative of the proteolytic cleavage (54).


    QUANTITATIVE PROTEOMICS
 TOP
 ABSTRACT
 THE PROTEIN INFERENCE PROBLEM:...
 ASSEMBLING PEPTIDES INTO...
 COMPUTATIONAL TOOLS
 PROTEIN SEQUENCE DATABASES
 IDENTIFICATION OF MATURE FORMS...
 QUANTITATIVE PROTEOMICS
 INTEGRATION OF PROTEOMIC AND...
 INTEGRATION OF MULTIPLE SHOTGUN...
 CONCLUDING REMARKS
 REFERENCES
 
Mass spectrometry is increasingly used not only for the identification of proteins but also for their quantification (quantitative proteomics) (for recent reviews, see Refs. 1 and 5557). The two problems are interdependent and in fact complementary, e.g. the quantitative information can be used to resolve some of the peptide grouping ambiguities.

Although methods are being developed for the determination of absolute protein abundance levels (5860), most current quantitative proteomic experiments are based on the determination of relative protein expression levels between two or more different pools of proteins. In the most straightforward application, the quantitative proteomics is used as the equivalent of the microarray gene expression profiling approach (61) except that the measurement is performed at the protein, rather than mRNA, level. The shotgun proteomic approach can be made quantitative by applying stable isotope labeling of proteins or peptides. This is illustrated in Fig. 9A using the most common case of a two-sample comparison. The compared samples can represent two different cell states (e.g. before and after a perturbation) or cells grown under different conditions. The proteins are labeled separately with either light (sample 1) or heavy (sample 2) stable isotopes. The labeling can be done in a number of ways, e.g. chemically (ICAT, iTRAQ, etc.) or metabolically (e.g. SILAC) (for reviews, see Refs. 1 and 5557). Proteins from both samples are mixed and enzymatically digested into peptides. Labeled peptides are separated and subjected to sequencing and quantification using mass spectrometry. Peptides are identified from MS/MS spectra as described previously, and the quantitative information is extracted either from MS spectra (e.g. in ICAT- or SILAC-based quantitative methods) or directly from MS/MS spectra (iTRAQ) using software tools specifically developed for that purpose (6265). Quantification is based on measuring relative ion intensity of heavy and light labeled peptide ions. Relative abundances of peptides between the two samples are then combined to compute the relative protein abundances. In addition to global protein profiling experiments, the same quantitative strategy can be used in a targeted way, e.g. for distinguishing members of macromolecular complexes or cell organelles from nonspecifically co-purifying proteins (6669). It should also be mentioned that although the discussion here is centered on the quantitative proteomic approach based on isotopic labeling, it applies equally to semiquantitative methods based on simple peptide counts (70, 71) or on peptide ion current profiling (72, 73).11



View larger version (30K):
[in this window]
[in a new window]
 
FIG. 9. Quantitative shotgun proteomic analysis using stable isotopes. A, in one quantitative method (ICAT), proteins are labeled using light or heavy mass tags and then digested into peptides. Labeled peptides are captured and sequenced using tandem mass spectrometry. Peptides are identified from MS/MS spectra using database searching and used to infer which proteins are present in the sample. Relative abundances of peptides between the compared samples are extracted from MS data, and then the relative protein abundance ratios are computed based on the ratios of observed peptides. The relative abundance ratio of a distinct peptide is a direct measure of the abundance ratio of its protein (for peptide 1, protein A; for peptide 3, protein B; and for peptide 4, protein C), whereas it is a weighted average of the abundance ratios of all its corresponding proteins in the case of a shared peptide (peptides 2 and 5). B, connection between the relative quantification observed at peptide and protein levels. Distinct peptides 1 and 3 directly measure the relative protein abundance ratios of their corresponding proteins A and B, RA and RB. The relative abundance ratio of the shared peptide 2, r2, can be anywhere between the protein ratios RA and RB depending on the absolute abundances of A and B. Quantitative information can be used to resolve some cases of shared peptides. If peptides 4 and 5 have significantly different ratios r4 and r5, it can be explained by the presence of protein D is the sample.

 
The relative protein abundance ratios between the compared samples are computed based on the ratios of observed peptides. For a distinct peptide, its relative abundance ratio is a direct measure of the abundance ratio of its corresponding protein.12 In contrast, the relative abundance ratio in the case of a shared peptide is a weighted average of the abundance ratios of all its corresponding proteins with the weighting factors being determined by the absolute abundance of those proteins in the samples. This is illustrated in Fig. 9A where two differentiable proteins, A and B, are inferred to be present in the samples based on the identification of three peptides (proteins C and D are discussed later in this section). In this example, one of the peptides (peptide 2) is shared between the two proteins, and the other two peptides (peptides 1 and 3) are unique to protein A or B, respectively. The relative protein abundance ratios of these proteins, RA and RB, can be measured using the relative abundance ratios of their distinct peptides, r1 and r3, respectively (see Fig. 9B). The relative abundance ratio of the shared peptide 2, r2, can be anywhere between the protein ratios RA and RB depending on the absolute abundances of A and B in both samples that are being compared, NA, NB (sample 1) and NA', NB' (sample 2).

An example of this kind is shown in Fig. 10. In that experiment, lipid rafts were isolated from both control and stimulated Jurkat human T cells, and the protein samples were quantitatively compared using the ICAT method (27). A number of peptides were identified that are shared between several members of the guanine nucleotide-binding protein (G protein) family, including {alpha} inhibiting activity polypeptides 1, 2, and 3. Isotopically labeled peptides for which quantitative information is available (Cys-containing ICAT-labeled peptides) are shown in Fig. 10. The identification of Gi {alpha}3 and {alpha}2 proteins was also supported by several additional unlabeled distinct peptides for which no quantitative information is available (sequences not shown). The quantification of the protein Gi {alpha}3 was based on one distinct ICAT-labeled peptide that was found to be present at higher abundance in the stimulated sample compared with the control sample (relative peptide abundance ratio close to 2:1). At the same time, quantification of Gi {alpha}2 was based on five distinct ICAT-labeled peptides showing no significant difference in their abundances between the two samples (average relative abundance ratio close to 1:1). Thus, although protein quantification based on a single distinct peptide should be interpreted with caution, it appears likely that these two members of the same gene family exhibited a different response to the external stimulation with Gi {alpha}3 being up-regulated and Gi {alpha}2 not changing. Interestingly the relative abundance ratios of the other two ICAT-labeled peptides that were shared between Gi {alpha}3 and Gi {alpha}2 were much closer to that of peptides unique to Gi {alpha}2 (the shared peptides are also present in another member of the gene family, Gi {alpha}1, but no distinct peptides were identified that would suggest the presence of that protein in the sample). This indicates that the absolute abundance level of the protein Gi {alpha}2 was greater than that of Gi {alpha}3 in agreement with a rough protein abundance measure such as the number of matched MS/MS spectra, 79 versus 27, determined for Gi {alpha}2 and Gi {alpha}3 proteins, respectively.



View larger version (18K):
[in this window]
[in a new window]
 
FIG. 10. Identification and quantification of a group of peptides shared between several members of the guanine nucleotide-binding protein (G protein) family, {alpha} inhibiting activity polypeptides 1, 2, and 3 (Swiss-Prot accession numbers P04898, P04899, and P08754).

 
Quantitative information can therefore be used to resolve some cases of shared peptides or suggest the presence of multiple protein isoforms having a different biological function. This is again illustrated in Fig. 10 where protein C is identified by peptides 4 and 5 having relative peptide abundance ratios r4 and r5, respectively. Peptide 5 is also present in protein D. Because there are no distinct peptides in the dataset that correspond to protein D, it is not possible to conclude that this protein is present in the sample given the sequences of identified peptides alone (subset protein identification). At the same time, if protein D is in the sample, its presence would be reflected in the relative abundance ratio of the shared peptide 5, whereas the relative abundance ratio of the distinct peptide 4 would always be determined solely by the relative abundance of protein C. Thus, significantly different ratios r4 and r5 would indicate the presence of protein D in the sample. Note that the reverse is not necessary true, i.e. observation of consistent peptide ratios does not rule out the presence of protein D because it can simply reflect a significantly lower abundance level of protein D compared with protein C.

Furthermore observation of peptides with inconsistent relative abundance ratios when all peptides appear to be distinct (according to the protein sequence database used in the analysis) can point to the presence of a novel biologically significant protein form (e.g. novel splice variant, product of protein degradation, etc.). One such interesting example has been noticed recently in a quantitative proteomic study concerned with the identification of a human transcription factor using an ICAT proteomic approach (74). Among six identified peptides that were assigned to cellular nucleic acid-binding protein (CNBP), three peptides from the N terminus of the protein had an average relative abundance ratio (enriched sample versus control) of less than 3:1, whereas the other three peptides derived from the C-terminal portion of the protein had ratios of more than 7:1. Thus, it has been suggested that two different forms of CNBP (or CNBP and its homologue) are present in the sample. In other cases, inconsistencies in the relative peptide abundance ratio can be due to post-translational modification of the protein, e.g. if one of the peptides is phosphorylated and its abundance (in the unmodified form) is different in the compared samples.

A close connection between the problem of assembling peptides into proteins and determining protein abundance ratios suggests a new integrated approach for dealing with quantitative proteomic data. At present, these two tasks are performed separately with the protein ratios computed using peptide ratios and the apportionment of shared peptides among their corresponding proteins performed independently of the quantitative data. Instead the apportionment of shared peptide and creation of the protein summary lists can be made dependent on the quantitative information observed at the peptide and protein level. This should enhance the interpretation of the data by resolving some of the ambiguities discussed above. However, such an approach would require high quality quantitative proteomic data. At present, the accuracy of relative peptide abundance ratios extracted from mass spectra using automated software tools often requires manual validation. This is especially true in the case of peptide "outliers," i.e. peptides whose relative abundance ratios are significantly different from the ratios observed for other peptides assigned to the same protein, which are of utmost interest in the context of this discussion. The development of such integrated tools is an imminent task for shotgun proteomics.


    INTEGRATION OF PROTEOMIC AND TRANSCRIPTIONAL DATA
 TOP
 ABSTRACT
 THE PROTEIN INFERENCE PROBLEM:...
 ASSEMBLING PEPTIDES INTO...
 COMPUTATIONAL TOOLS
 PROTEIN SEQUENCE DATABASES
 IDENTIFICATION OF MATURE FORMS...
 QUANTITATIVE PROTEOMICS
 INTEGRATION OF PROTEOMIC AND...
 INTEGRATION OF MULTIPLE SHOTGUN...
 CONCLUDING REMARKS
 REFERENCES
 
Quantitative MS/MS-based proteomic analysis and DNA microarray analysis are two complementary technologies that measure gene expression at the protein and RNA levels, respectively. Due to its technically more advanced stage, the microarray technology (61) allows monitoring of RNA expression levels for the number of genes that is significantly larger than the number of proteins that can be accurately identified and quantified in a typical proteomic experiment, and it can be effectively used for the analysis of alternative splicing and genome annotation (75, 76). However, due to post-transcriptional regulatory mechanisms such as protein translation, post-translational modifications, and degradation, the microarray measurements of mRNA expression patterns alone are not sufficient for understanding protein expression and function (77, 78). Thus, by combining transcriptional and proteomic analysis of the same samples, it becomes possible to achieve a better understanding of complex biological systems. A number of integrative proteomic and transcriptional analyses have been recently performed, including studies on model organisms and mammalian cells and tissues (7982). A recent review on the subject of integrating microarray and proteomic data can be found in Ref. 83, and the discussion here will be limited to the issues related to the protein inference problem.

Integration of different data types requires a good understanding of the underlying technologies and their limitations. Although a detailed review of the microarray technology goes beyond the scope of this article, it is interesting to note that many of the difficulties discussed here in the context of quantitative MS-based proteomic experiments are also present in the analysis of gene expression using microarrays. Unlike quantitative shotgun proteomics, in the case of oligonucleotide arrays the sequences of DNA probes present on the array are known in advance (the sequences of peptide "probes" in shotgun proteomics are determined from the spectra). Still ambiguities remain in connecting DNA probes to the target mRNAs (84). For example, multiples probes can map to the same gene; the same probe can map to different products of the same gene or even to multiple genes. Multiple probes mapping to the same gene can produce significantly different expression ratios; outliers might indicate the presence of several alternative splice forms, but they could also be a result of inaccurate quantification (75). Furthermore cross-hybridization, i.e. binding of the labeled RNA to non-target homologous probe sequence, introduces additional errors (85).

Integration of proteomic and transcriptional data is hindered by lack of relevant annotations and the use of different accessioning schemes. The information available for each probe present on an Affymetrix chip, for example, includes an arbitrary identification number, the GenBankTM accession number of the target RNA sequence, and brief functional annotation. In the case of MS-based proteomics, experimental MS/MS spectra are assigned peptides, and then peptides are assembled into proteins using a variety of protein sequence databases. Each protein sequence database has its unique accessioning scheme, and the degree of sequence annotation does not always allow easy cross-reference between different protein sequence databases or between protein and genomic sequence databases.

Correlating mRNA and protein data can be facilitated by selecting a well annotated database, e.g. UniGene, as a common reference (86) (Fig. 11). The UniGene database is created by an automated partitioning of GenBankTM sequences into a non-redundant set of gene-oriented clusters with each cluster containing sequences representing a unique gene. A number of tools have been described recently that can link the probes from Affymetrix arrays to the UniGene cluster identifiers (87). In turn, MS-derived protein identification datasets can be related to the UniGene clusters using the known connection between the RefSeq protein sequence database and UniGene. A tool for direct mapping of Affymetrix probes to RefSeq sequences has also been described (84).



View larger version (35K):
[in this window]
[in a new window]
 
FIG. 11. Integration of proteomic and transcriptional data. mRNA and proteomic data can be linked using a common reference database such as UniGene.

 
Although UniGene and RefSeq can provide a common reference for connecting proteomic and transcriptional data, a one-to-one correspondence will not always be possible. For example, some DNA probes cannot be linked to any UniGene clusters because their target sequences have been removed from the latest version of GenBankTM or deemed to be redundant and excluded from the UniGene build process. Furthermore in many proteomic studies, proteins are identified by searching MS/MS spectra against more complete protein sequence databases than RefSeq, e.g. IPI or Entrez Protein. Connecting protein sequences that are not annotated in RefSeq to the UniGene clusters is not straightforward. One of the main difficulties again comes from alternative splice forms. Without the ability to resolve different alternative splice forms, both on the part of proteomic and transcriptional analyses, the association between the two data types is not unique. As a result, the integration and correlation between proteomic and transcriptional data in some cases can be performed only at the gene level with mRNA and protein expression ratios averaged over multiple products of the same gene. Despite these difficulties, integrated analysis of mRNA and protein data can provide very valuable insights into complex biological systems.


    INTEGRATION OF MULTIPLE SHOTGUN PROTEOMIC DATASETS AND GENE-CENTERED DATA INTERPRETATION
 TOP
 ABSTRACT
 THE PROTEIN INFERENCE PROBLEM:...
 ASSEMBLING PEPTIDES INTO...
 COMPUTATIONAL TOOLS
 PROTEIN SEQUENCE DATABASES
 IDENTIFICATION OF MATURE FORMS...
 QUANTITATIVE PROTEOMICS
 INTEGRATION OF PROTEOMIC AND...
 INTEGRATION OF MULTIPLE SHOTGUN...
 CONCLUDING REMARKS
 REFERENCES
 
The discussion so far has been limited to the analysis and interpretation of the data generated in a "single" experiment where all MS/MS data are acquired on a particular biological sample of interest. However, due to technical limitations of current proteomic technologies, in any given large scale proteomic experiment only a subset of the entire proteome is identified. In repeated analysis of the same type, the cumulative number of identified peptides and proteins quickly reaches a saturation point. A more comprehensive characterization of the entire proteome can be achieved by combining the data from multiple diverse experiments (different tissues or cell types, enrichment schemes, etc.) (50, 51, 88, 89). Furthermore performing secondary, centralized analysis of the datasets previously analyzed and published by individual laboratories can uncover interesting global trends not apparent in the analysis of any single dataset alone (Fig. 12).



View larger version (45K):
[in this window]
[in a new window]
 
FIG. 12. Submission of mass spectrometry data to public repositories allows extraction of additional valuable information that otherwise would be missed in the analysis of a single experiment by an individual laboratory. More comprehensive characterization of the entire proteome can be achieved by combining the data from multiple diverse experiments (different tissues or cell types, enrichment schemes, etc.). In one such example, the PeptideAtlas project, MS/MS datasets from different laboratories are processed using the same high throughput pipeline. Identified peptides are mapped to the genome via the Ensembl gene index. Peptide sequences along with the chromosomal locations, sample annotation, and other information are stored in a relational database. The data can be visualized in the Ensembl genome browser, and the database itself can be mined to study global trends of protein expression. Peptide identification data, if communicated back to the database developers and annotators, can also be used to improve the quality of the protein sequence databases. Reanalysis of high quality MS/MS spectra that are left unassigned in a typical database search against a protein sequence database can lead to the identification of new open reading frames, novel splice forms, and sequence polymorphisms.

 
The task of combining and comparing multiple large scale datasets generated using different biological samples (e.g. different cell states or tissues) requires the development of new approaches and computational tools. Due to the peptide-centric nature of shotgun proteomics, diverse datasets (from the same organism) can be best combined at the peptide level by linking the sequences of the identified peptides to a common gene index. One such approach, based on the mapping of peptides observed in a large group of proteomic experiments to the Ensembl genome, has been described recently (51) and implemented in a public resource, PeptideAtlas (www.peptideatlas.org). In this approach, peptide identifications passing a certain probability threshold are matched to proteins in the Ensembl database. The chromosomal coordinates, or multiple sets of coordinates in the case of peptides matching to more than one gene, and Ensembl protein accession numbers are retrieved for all matched peptides. The results are stored in a relational database and can be