MCP
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


Originally published In Press as doi:10.1074/mcp.T600049-MCP200 on December 12, 2006.
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Supplemental Data
Right arrow All Versions of this Article:
T600049-MCP200v1
6/3/527    most recent
Right arrow Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow Glossary
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Price, T. S.
Right arrow Articles by Grosser, T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Price, T. S.
Right arrow Articles by Grosser, T.
Social Bookmarking
 Add to CiteULike   Add to Complore   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?
Molecular & Cellular Proteomics 6:527-536, 2007.
© 2007 by The American Society for Biochemistry and Molecular Biology, Inc.


Technology

EBP, a Program for Protein Identification Using Multiple Tandem Mass Spectrometry Datasets*,S

Thomas S. Price{ddagger}, Margaret B. Lucitt{ddagger}, Weichen Wu{ddagger}, David J. Austin{ddagger}, Angel Pizarro{ddagger}, Anastasia K. Yocum§, Ian A. Blair§,, Garret A. FitzGerald{ddagger},|| and Tilo Grosser{ddagger},**

From the {ddagger} Institute for Translational Medicine and Therapeutics and § Center for Cancer Pharmacology, University of Pennsylvania, Philadelphia, Pennsylvania 19104


    ABSTRACT
 TOP
 ABSTRACT
 EXPERIMENTAL PROCEDURES
 RESULTS
 DISCUSSION
 REFERENCES
 
MS/MS combined with database search methods can identify the proteins present in complex mixtures. High throughput methods that infer probable peptide sequences from enzymatically digested protein samples create a challenge in how best to aggregate the evidence for candidate proteins. Typically the results of multiple technical and/or biological replicate experiments must be combined to maximize sensitivity. We present a statistical method for estimating probabilities of protein expression that integrates peptide sequence identifications from multiple search algorithms and replicate experimental runs. The method was applied to create a repository of 797 non-homologous zebrafish (Danio rerio) proteins, at an empirically validated false identification rate under 1%, as a resource for the development of targeted quantitative proteomics assays. We have implemented this statistical method as an analytic module that can be integrated with an existing suite of open-source proteomics software.


Proteomics methods aim to identify the proteins expressed in biological samples (1) just like transcriptomics methods detect and quantify the abundance of RNA molecules (2). Both techniques can point to particular tissue correlates, such as disease biomarkers, and find networks of co-regulated genes and proteins that reveal the mechanisms underlying biological processes (3). As thousands of simultaneous measurements are taken, however, balancing sensitivity of detection against rates of false positive error is a major challenge. With rapid advances in large scale proteomics technology, there is an urgent need for methods that maximize sensitivity and control of false protein identifications by integrating the results of multiple experiments.

In "shotgun" proteomics studies, a protein sample is digested with a proteolytic enzyme, and the resulting peptide mixture is separated by LC prior to conducting collision-induced dissociation and MS/MS (1). This generates thousands of peptide ion fragmentation spectra containing diagnostic amino acid sequence information. Peptide sequences are inferred by matching the resulting product ion spectra to theoretical or empirical spectra in peptide sequence databases (4). No match is exact, so the peptide identifications are not certain. Database search algorithms either calculate a probability that the spectrum results from a random fragmentation (e.g. MASCOT, Ref. 5) or use heuristic scores (such as SEQUEST, Ref. 6), which can then be converted into probabilities of accurate peptide identification (7, 8). A variety of other statistical models have also been proposed to score peptide matches (9). These algorithms have to some extent complementary merits, and the sensitivity and specificity of peptide sequence identifications from MS/MS data can be improved by the use of more than one search algorithm (1012). A recent comparison of five database search programs concluded with the recommendation to use consensus scoring from at least two different algorithms and noted the need for appropriate methods of combining scores from different algorithms (11). Database search methods using different assumptions and procedures can result in markedly different sets of protein identifications (e.g. Ref. 13).

Probable peptide sequences need to be assembled to proteins (14). A simple method for identifying proteins is to filter the peptide matches using empirically derived thresholds and then sort by parent protein. Variants of this approach do not attempt to estimate false identification rates, nor do they deal in a useful way with "degenerate" peptides whose sequence matches more than one protein in the database (1517). More sophisticated methods use statistical models to calculate protein expression probabilities (12, 1820) or ranking scores (21) based on the number and quality of peptide sequence matches and other criteria, such as protein length, the size of the database, and the degeneracy of the peptide sequence identifications. These models can be empirically validated by searching against databases containing random "distractors" such as reversed protein sequences (12, 2123).

A major problem is that many protein identifications have low reproducibility (24): in general, only abundant proteins are reliably identified (25, 26). Sensitivity is improved by performing multiple technical replicates (27), perhaps using different instrumentation and technical procedures (10, 25). This creates a demand for analytical methods that can integrate data from multiple experiments, which will likely increase as common data standards emerge that facilitate the exchange of proteomic datasets from diverse sources.1,2

We present a statistical model that implements consensus scoring of peptide identifications from multiple search algorithms and combines information from independent replicate experiments in a single analysis while allowing sensitivity to be balanced against false identification rate. The model assumes that every peptide sequence that could theoretically result from enzymatic digestion of a protein in the database has a chance of being identified in the search results, whether correctly or incorrectly. The probabilities of correct identification are combined across multiple peptide searches using a function that returns the maximum probability from consensus identifications and penalizes non-consensus identifications. Both correct and incorrect peptide sequence identifications are assumed to occur at random in this "space" of peptides at rates that are governed by model parameters including protein length, estimated protein abundance, the size of the search database, and the number of peptide sequence identifications in the dataset. For each protein in the database, a likelihood ratio is calculated for the possibility that the peptide identifications whose sequence matches the protein are all incorrect. These likelihood ratios are used to estimate the expression probabilities from which updated parameter estimates are obtained. The procedure is iterated until convergence at the maximum likelihood estimates. Replicates are integrated by simultaneously estimating multiple sets of model parameters. Degenerate peptides whose sequence matches multiple proteins are treated using "Occam’s razor," a principle by which the smallest set of probable proteins is chosen that is sufficient to explain the peptide sequence identifications (19). An outline of the statistical model used and its implementation in the program Empirical Bayes Protein identifier (EBP)3 is described under "Experimental Procedures." The software is available upon request for download as an extension to the Windows release of the Institute of Systems Biology Trans-Proteomic Pipeline.

We applied the model to a set of three biological replicates of zebrafish proteins, searched by both SEQUEST and MASCOT algorithms. The zebrafish has emerged as an informative model organism for studies of vertebrate biology (41), genetics (28), and pharmacology (29, 30). Its genome has been recently sequenced, and an acceptably complete and non-redundant protein sequence database is available (35). The aim was to identify as many reliably expressed proteins as possible in larvae 5 days postfertilization, a developmental stage at which most organ systems are functional. This would allow the selection of candidate proteins for targeted quantitation (31) in mutagenesis (28) and chemical screens (32).


    EXPERIMENTAL PROCEDURES
 TOP
 ABSTRACT
 EXPERIMENTAL PROCEDURES
 RESULTS
 DISCUSSION
 REFERENCES
 
Sample Collection
Zebrafish embryos were bred and raised from natural matings of wild-type tail long fin fish as described previously (33) and staged according to existing protocols (34). Embryos were collected at 120 h postfertilization, immediately flash frozen using liquid nitrogen, and stored at –80 °C. Frozen embryos were thawed and washed twice with PBS to remove any residual medium before lysis. Five tubes of 20–30 embryos were lysed in 7 M urea, 2 M thiourea, 4% CHAPS (GE Healthcare), 100 mM DTT (Bio-Rad), Phosphatase Inhibitor Mixture 11 (Sigma), and Complete protease inhibitor mixture tablet (Roche Applied Science) and disrupted using a Qiagen TissueLyser (2 x 3 min at 30 Hz). Samples were centrifuged at 12,000 x g for 30 min at 4 °C, and the supernatant was collected. Proteins were precipitated using methanol/chloroform and resuspended in 0.2% (w/v) RapiGestTM SF (Waters, Milford, MA), in 50 mM ammonium bicarbonate (Sigma). Samples containing 2.0 mg of protein were reduced with 5 mM DTT for 60 min at 60 °C and alkylated with 5 mM iodoacetamide (Bio-Rad) for 30 min at room temperature. Trypsin (Promega, Madison, WI) digestion using a 1:20 enzyme:protein (w/w) ratio at 37 °C for 16 h was carried out in 0.2% (w/v) RapiGest SF (Waters), in 50 mM ammonium bicarbonate. The pH was then lowered to 3 with 1% formic acid solution followed by incubation at 37 °C for 1 h and centrifugation. The precipitant was removed, and an equal volume of strong cation exchange (SCX) mobile phase A (10 mM ammonium formate, 25% acetonitrile, pH 3) was added. The solution was incubated at 4 °C for 4 h before SCX separation.

Two-dimensional Chromatography and Mass Spectrometry
Protein digestion, off-line fractionation, and LC-MS/MS were performed as described previously (25). SCX separation of the peptides was carried out on a PolySulfoethyl A column (100 x 4.6 mm, 5 µm, 300 Å; PolyLC, Columbia, MD) at a flow rate of 0.2 ml/min. The sample was loaded for 10 min with mobile phase A (10 mM ammonium formate, 25% acetonitrile, pH 3), and a gradient was run to 100% mobile phase B (500 mM ammonium formate, 25% acetonitrile, pH 6.8) to elute the peptides over 80 min. Forty fractions were collected for reversed phase separation on line to a Thermo Finnigan LTQ ion trap mass spectrometer (Thermo Electron, San Jose, CA) using ESI. Each fraction was injected onto a Vydac C18 (Everest, 150 x 1 mm, 300 Å, 5 µm; Bodmann, Aston, PA) column with 0.1% formic acid, 0.01% trichloroacetic acid in water for 10 min before a gradient (30 µl/min) was run over 120 min from 3 to 70% mobile phase B (0.1% formic acid, 0.01% trichloroacetic acid in acetonitrile). The mass spectrometer was operated in a data-dependent MS/MS mode (m/z 300–2000) in which the top seven ions were subjected to fragmentation. Dynamic mass exclusion was enabled with a repeat count of 2 every 45 s for a list size of 250.

Data Processing
Raw mass spectra were converted to DTA peak lists using BioWorks Browser 3.2 (Thermo Finnigan, San Jose, CA) with the following parameter settings: peptide mass range, 300–5000 Da; threshold, 10; precursor mass, ±1.4 Da; group scan, 1; minimum group count, 1; minimum ion count, 15. Searches were conducted using SEQUEST and MASCOT against a database containing both forward and reverse sequences of proteins contained in the International Protein Index (IPI) Danio rerio protein sequence database version 3.07 (35). It was specified that peptides should have a maximum of two internal cleavage sites with methionine oxidation and cysteine carbamidomethylation as possible modifications. SEQUEST searches specified that peptides should possess at least one tryptic terminus and used a peptide mass tolerance of ±1.4 Da and a fragment ion tolerance of 0. MASCOT searches specified tryptic digestion and used a peptide mass tolerance of ±1.5 Da and a fragment ion tolerance of ±0.1 Da. The search results were converted into pepXML format (42). Peptide identification probabilities for both SEQUEST and MASCOT searches were calculated by executing PeptideProphet (7). SEQUEST results were processed using the "-Ol" tag, which uses {Delta}Cn* values unchanged. Peptide sequence identifications with probabilities greater than 0.05 were exported for the protein probability analyses. EBP and ProteinProphet (19) analyses were run using SEQUEST and MASCOT results for each sample both separately and together. EBP analyses were run using the default settings except for the calculation of number of trypsin digests per protein, which specified peptides with at least one tryptic terminus for the SEQUEST analyses and two tryptic termini for the analyses of MASCOT and combined SEQUEST plus MASCOT data. Combined EBP analyses were conducted in which the results for all three replicates were analyzed together with protein probabilities calculated for the hypothesis that expression was evident in at least one of the three samples.

Statistical Model
Assumptions
Peptide Probabilities—
Table I lists and describes the quantities used or estimated as part of the model. We begin with a set of MS/MS spectra to which possible peptide sequence identifications have been assigned using one or more search algorithms. Each peptide sequence identification has an associated probability ranging from 0 for random matches to 1 for peptide sequences that can be identified with absolute certainty. The first step in our procedure summarizes these probabilities. The combined probability pi s for peptide identification i of spectrum s is computed as the proportion of algorithms (e.g. MASCOT and SEQUEST) that identify peptide sequence i multiplied by the maximum probability identification. Repeated identifications of the same peptide sequence in multiple spectra of the same search are not treated as independent events but as repeated fragmentations of the same peptide resulting in spectra of varying quality. A conservative strategy is adopted by which only the highest probability identification is retained for each peptide. The resulting dataset comprises n peptide identifications, each of which is either correct with peptide probability pi or is a random match with probability 1 – pi. We assume that the accuracy of each peptide identification is an independent random event. For ease of computation, the peptide probabilities may be bounded at some low threshold by ignoring, say, all pi ≤ 0.05.


View this table:
[in this window]
[in a new window]

 
TABLE I Notation of quantities used in algorithm

 
Protein Matches—
We seek to identify a set of "true" protein matches T that includes every protein that is truly expressed in the sample and whose sequence matches at least one correctly identified peptide sequence. This requires first identifying a set of "true plus homologue" protein matches H that includes every protein whose sequence matches at least one correctly identified peptide whether or not these proteins are actually expressed in the sample. That is, H includes all truly expressed proteins plus proteins that are not expressed but whose sequence matches correct peptide hits in homologous proteins that are truly expressed. In symbolic terms, T equals or is a subset of H, and both T and H are subsets of the set of all proteins in the search database, {Omega}.

The search database {Omega} contains N proteins, each of which may be expressed in our sample. We know that protein j {epsilon} {Omega} when subjected to enzymatic digestion theoretically results in a set of proteolytic peptides, Dj, which may include peptides with missed cleavages or non-enzymatic termini and those with specified modifications. A function of the size of this set, yj = exp [Formula], is a useful measure of protein length: assuming a log-normal null distribution of search scores (8), the probability that the highest scoring random match is to one of the |Dj| theoretical digests from protein j is proportional to exp [Formula]. Let Mj be the set of zj peptide identifications whose sequence matches protein j, i.e. those peptide identifications that among the theoretical digest products of protein j: Mj = {pi: i {epsilon} Dj; i = 1, ..., n}.

Degeneracy—
Our treatment of degenerate peptide identifications, whose sequence matches more than one protein in the database, follows that of ProteinProphet (19). In some instances, a group of proteins exclusively share a set of peptide sequence matches. These proteins are compiled as a "degenerate protein group," and a single expression probability is estimated for the group of proteins because any or all of them may be present in the sample, and they cannot be distinguished. Whenever a peptide matches more than one protein but is not part of such a degenerate cluster (in other words, it is part of an overlapping but not identical subset of matches), it is assumed that the protein matches are the result of either one true match, the remainder being matches due to homology with the expressed protein, or random matches due to error in matching the mass spectrum to the peptide sequence. The choice of which protein match is the correct one is treated as a multinomial event with the class probabilities given by a set of "weights" wi j, {Sigma}j: i {epsilon} Dj wi j = 1, whose values depend on the relative expression probabilities of the proteins j to which the peptide matches. The probability that peptide i is a true match to protein j is given by wi j pi, the probability of a homology match is given by (1 – wi j)pi, and the probability of a random match is 1 – pi. The initial estimates for the weights are chosen to be equal for all proteins that match peptide i.

Protein Abundance—
Highly abundant proteins are likely to accumulate many more peptide matches than proteins of low abundance. Protein abundance has been shown to be strongly correlated with the number of spectra whose peptide identifications match the protein sequence at least for MS/MS using non-data-dependent acquisition (13, 26). Let us call this quantity vj, estimated by the total of weighted peptide identification probabilities for all spectra matching peptides in j. Proteins are assigned to ordinal abundance categories aj by binning these vj at preset thresholds. Most proteins have no peptide matches and are assigned to the lowest abundance category.

Parameters—
Protein expression probabilities are calculated conditionally on a set of parameters that summarize the data and govern the rate of accumulation of true and random peptide matches to proteins. For proteins in abundance class a, the model parameters are {theta}a = (Na, {tau}a, na, {gamma}a, {kappa}a, {lambda}a). Na is the number of proteins of abundance a, and {tau}a is their total length so that the entire protein space {Omega} contains N = {Sigma}a Na proteins with a total length of {Sigma}a {tau}a. na is the total number of peptide matches to proteins of abundance a. Note that because of degeneracy {Sigma}a na ≥ n. {gamma}a is the proportion of proteins of abundance a that are also in H, and {kappa}a is their total length. Thus, H contains {Sigma}a Na{gamma}a proteins with a total length of {Sigma}a {kappa}a. Finally {lambda}a is the number of these peptide identifications that are correct.

Probability of Membership of H
The condition Hj = j {epsilon} H is true if at least one of the peptide identifications in Mj is a correct match. We assume that correct peptide identifications are independently and randomly distributed among the proteins in H at a rate that is proportional to the effective length of the protein so that the number of correct identifications matching a protein j {epsilon} H is Poisson distributed with parameter {lambda}aj·yj/{kappa}aj. We also assume that the naj {lambda}aj incorrect peptide identifications are independently randomly distributed among the proteins so that the number of incorrect identifications matching a random protein j {epsilon} {Omega} is Poisson distributed with parameter (naj{lambda}ajyj/{tau}aj. Using Bayes’ theorem, the estimated odds of membership of H given the data equal the prior odds Formulaaj/(1 Formulaaj) multiplied by the likelihood ratio of membership of H given the peptide matches Mj and the expected parameter values Formulaaj. (The "hat" notation (^) indicates an estimated value.)

Estimation of the Protein Expression Probabilities—
The condition Tj = j {epsilon} T requires that at least one of the peptide identifications matching j is correct (i.e. j {epsilon} H) and that this is so because the digest product is truly expressed in the sample. The estimated weights wi j together with Mj and Formulaaj are used to calculate the probabilities of membership of T conditional on membership of H, and hence the probabilities of membership of T –the true protein expression probabilities.

Estimation Using the "Expectation-Maximization" (EM) Algorithm
Maximum likelihood values for the parameters are calculated using the EM algorithm. The posterior probabilities of homology set membership hj = Pr(Hj | {theta}, Mj) are initialized using hj = {gamma}aj[0] where {gamma}aj[0] is an initial parameter value. (The "posterior" probability means the probability conditional on the data, Mj.) At the "Expectation" step, we calculate the expected values of the parameters Formula given the estimated probabilities hj. For the "Maximization" step, we compute the probabilities of membership of H and T given the matching peptide identifications Mj and the expected parameter values. The peptide weights are then updated according to a function that assigns the highest weight to the protein or proteins with the greatest probability of expression. When iterated, this procedure converges to a maximum function that downweights all but the most likely proteins, thereby restricting the results set to a minimal set of proteins necessary to explain the peptide identifications. The updated weights are used to update vj and hence estimate the abundance categories âj. The effect of the Maximization step is to successively maximize the profile likelihoods L(H | M, Formula, â), L(T | h, M, Formula, â, w), L(w | T, M, â), and L(a | T, w) under the simplifying assumption that all Hj and Tj are conditionally independent given M, Formula, â, and w. The effect of the Expectation step is to maximize the profile likelihood L({theta} | h). The Expectation and Maximization steps are repeated until convergence at the maximum likelihood estimates of {theta}, H, T, a, and w given M.

Expectation Step—
The Expectation step proceeds as follows to calculate the expected values of the parameters Formulaa for each abundance category a. Na and Na are calculated, respectively, as the number of proteins of abundance a and the number of peptide identifications matching these proteins. Formulaa is the total length of these proteins. Formulaa is calculated as the expected value of an unweighted sum of Bernoulli variables Hj divided by Na. Formulaa is calculated as the expected sum of the Bernoulli variables Hj weighted by the protein "lengths." Finally Formulaa is computed as the expected sum of Bernoulli variables pi totaled over all the peptide matches to each protein.

Maximization Step—
The first part of the Maximization step consists in the calculation of maximum likelihood values for the probabilities hj given the data and the expected values of the parameters. According to Bayes’ theorem, the posterior odds that j {epsilon} H are given by the prior odds Formulaa/(1 – Formulaa) multiplied by the Bayes’ factor. The Bayes’ factor equals the likelihood ratio of hj given the estimated parameters Formulaaj and the peptide matches to protein j, Mj. (Henceforward the aj suffixes to the parameter estimates are omitted.) Algebraically the posterior odds that j {epsilon} H are given by Equation 1.

Formula 1(Eq. 1)

(The symbol ¬ indicates logical negation, equivalent to the Boolean operator NOT.) The likelihood that j

Formula 1

{epsilon} H is the probability that all zj peptide matches in Mj have arisen by chance,

Formula 2(Eq. 2)

where 0 < pi ≤ 1 and Poisson(n | {theta}) = e{theta}{theta}n/n!. The likelihood that j {epsilon} H equals the sum of the probabilities that there are s correct peptide identifications and zjs random matches in Mj, 0 ≤ s ≤ zj,

Formula 3(Eq. 3)

where 0 < pi ≤ 1 and cH(s) = {Pi}Mj (1 – pi). {Sigma}s set S {subseteq} Mj {Pi}Mj/S odds {pi}, summing over all possible S, subsets of Mj with s members. This likelihood is undefined if pi = 0 for any i, indicating that Hj is true with probability 1. Note also that there remains the possibility that Hj is true, if all the peptide matches are random (s = 0) or if there are no peptides matching it at all (zj = 0). This corresponds to the small but non-zero probability that j is a protein that is truly expressed in the sample but to which no identified peptides match. Knowing the quantities from Equations 2 and 3 we can now estimate the probabilities hj by applying Bayes’ theorem using Equation 1.

The next part of the Maximization step estimates the probabilities of true protein expression. The posterior probability of Tj conditional on Hj is the probability that correct peptide identifications truly match protein j rather than a homologous sequence in a different protein,

Formula 4(Eq. 4)

where 0 < pi ≤ 1 and cT(s) = {Pi}Mj (1 – pi). {Sigma}s set S {subseteq} Mj (1 – {Pi}S [1 – wi j]){Pi}S odds {pi}, summing over all possible S, subsets of Mj with s members. Estimates of the true protein expression probabilities can be derived using Equation 5.


Formula 5(Eq. 5)

The estimated protein expression probabilities Formula 5j are used to update the peptide weights according to a function that assigns the highest weight to the protein or proteins with the greatest probability of expression. These updated weights are used to calculate updated estimates Formula 5j and update the abundance classifications âj.

As an optional refinement to the model, the peptide probabilities pi can be adjusted for sequence homology. Assuming that degeneracy is independent of all the other parameters used in calculating the probabilities that the database search results are correct, the odds that peptide identification i is correct given that its sequence matches ki proteins in the search database is given by Equation 6.

Formula 6(Eq. 6)

This process can be incorporated into the EM loop, updating the peptide probabilities to their posterior values at each iteration until convergence.

Extension to Multiple Replicate Experiments—
The model allows integrated analysis of data from R independent replicate experiments. Separate sets of parameters, probabilities of membership of H, and probabilities of membership of T conditional on membership of H are estimated for each replicate. The overall protein expression probabilities are calculated using a combinatorial function of these probabilities: typically evidence of true expression is required from at least x replicates where 1 ≥ x ≥ R. The peptide weights are calculated as done previously using these overall expression probabilities.

EBP Implementation of the Algorithm
A program was developed to implement the algorithm using a slight refinement of the statistical model motivated by empirical Bayesian concepts, giving rise to the name: the Empirical Bayes Protein identifier. This modification uses binomial theory to estimate maximum likelihood hyperparameters for {lambda}, assuming an underlying Gamma distribution; Equations 2, 3, and 4 are then calculated by integrating over {lambda}. This procedure helps stabilize estimates of the parameter set Formula 6, avoiding a local maximum in the likelihood at Formula 6 = 0. Computational adaptations were used to optimize the speed of the algorithm for proteins with many peptide matches.

The default settings exclude peptide identifications with probabilities less than 0.5 for the purposes of calculating protein identification. When analyzing the combined results of two search algorithms, this is equivalent to including only those spectra that are matched to the same peptide by both algorithms. It is advantageous to exclude low probability peptide matches as far as possible because the likelihood function is dominated by the number of peptide hits to each protein. The default settings also specify two abundance categories, with a high threshold of vj ≥ 10 defining a category of abundant proteins, for the analysis of complex proteomes. These default values were chosen to be prudent by removing error associated with low probability peptide identifications and conservative in defining high abundance proteins.

To facilitate data integration, public data repositories need an exchangeable format and common data standards. Our implementation of the algorithm EBP plans to migrate its input and output format to the emerging analysisXML format from PSI Mass Spectrometry Working Group1,2 once this platform becomes stable. Currently EBP uses pepXML for input and an exchangeable output format, ebpXML, that closely resembles protXML. Both pepXML and protXML are open source data formats developed at the Institute of Systems Biology, Seattle, WA (42).

Analyses of a Test Protein Mixture
Mass spectra from the sample dataset, derived from electrospray LC-MS/MS of a mixture of 18 non-human proteins, were searched using SEQUEST and MASCOT against a sequence database containing human peptides plus the 18 non-human proteins and likely contaminants (36). The search results were preprocessed using PeptideProphet (7) to generate two sets of peptide sequence identifications and probabilities. Protein identifications were derived by applying the EBP and ProteinProphet (19) algorithms to the PeptideProphet outputs using the default settings. For the purposes of this study, an alteration was made to the PeptideProphet output to improve the comparability of the two algorithms. Specifically the lists of possible protein matches were merged for peptide sequences with indistinguishable mass (i.e. those with identical sequence except for leucine-isoleucine substitutions). This procedure is performed by default in the EBP software but is not performed by ProteinProphet. Proteins whose estimated expression probabilities exceeded cutoffs of p ≥ 0.9 and p ≥ 0.7 were reported. Sensitivity was calculated as the proportion of the 18 sample proteins identified. The empirical error rate was calculated as the proportion of proteins identified that were neither in the sample mixture nor known contaminants. The estimated error rate was calculated as the difference between the average expression probability and 1 for proteins passing the threshold.

Estimation and Empirical Validation of the False Positive Rate for Analyses of Complex Protein Samples
Electrospray ionization MS/MS spectra from three independent zebrafish protein samples, separated by two-dimensional LC, were searched against a concatenated database of forward and reverse zebrafish protein sequences using both SEQUEST and MASCOT. The results were preprocessed (8) to generate two sets of peptide sequence identifications and probabilities for each sample. The error rate was validated using the assumption that reversed sequence identifications are all false positives and that false positive identifications occur with equal probability to forward and reversed sequences. Hence, if the model identifies f forward sequence and r reversed sequence proteins at a given probability, the total number of false positives is empirically estimated as 2r, and the error rate of all identified proteins (i.e. both forward and reversed sequence proteins) is given by x = 2r/(r + f). The rate of false identifications to forward sequence proteins, an empirical estimate of the false positive error rate, is given by r/f = x/(2 – f). Accordingly the estimated probabilities for forward sequence identifications were adjusted by the function x {epsilon} (0, 1): x -> 2x/(x + 1), the inverse of x {epsilon} (0, 1): x -> x/(2 – x), to obtain the true expression probability estimates. This adjustment is calculated automatically in the software implementation of EBP.


    RESULTS
 TOP
 ABSTRACT
 EXPERIMENTAL PROCEDURES
 RESULTS
 DISCUSSION
 REFERENCES
 
Proof of Principle—
Proof of principle that the EBP algorithm can correctly identify the constituents in a protein mixture was established by applying the EBP algorithm to a pre-existing dataset used to validate the ProteinProphet algorithm (19, 36). The dataset was derived from electrospray LC-MS/MS analysis of a mixture of 18 non-human proteins titrated in varying concentrations to simulate the range of concentrations found in complex samples (36). Both algorithms identified largely the same proteins. The error rates in proteins identified by the EBP algorithm were conservative both at the level of individual and integrated analysis of search results (Table II). In contrast, ProteinProphet identified a somewhat greater proportion of the sample proteins than EBP but at the expense of considerably elevated levels of false positive error (Table II, columns a–c). ProteinProphet only gave accurate error estimates when the MASCOT and SEQUEST search results were combined using the method applied in EBP (Table II, column d). Using this method of data integration, both ProteinProphet and EBP identified 10 of the 18 proteins in the sample with high probability (p ≥ 0.9) and no errors.


View this table:
[in this window]
[in a new window]

 
TABLE II Rates of sensitivity and estimated and empirical rates of false positive error for ProteinProphet and EBP analyses of MASCOT (column a) and SEQUEST (column b) search results for the 18-protein mixture dataset (36)

Combined analysis of SEQUEST and MASCOT search results were performed by either using standard EBP/ProteinProphet methods (column c) or the EBP method for integrating multiple search datasets prior to ProteinProphet analysis (column d). Sens., sensitivity; Emp., empirical; Est., estimated.

 
Estimation and Empirical Validation of the False Positive Rate on Replicated Dataset—
Database searches of the three zebrafish samples followed by PeptideProphet (8) analysis resulted in two sets of peptide identifications for each sample that were thresholded at a minimum probability of 0.05. MASCOT identified 103,052 singly, doubly, and triply charged spectra with probabilities >0.05 from all three samples. SEQUEST identified 102,821 such spectra. Thus, a total of 205,872 possible peptide sequence identifications were submitted as input to EBP both as separate analyses per sample and algorithm and in a combined analysis in which proteins were called truly expressed when putatively identified in at least one of the three samples. The empirical error rates were well within the estimated error rates in analyses of both combined (Fig. 1, a and b) and individual samples (Supplemental Fig. 1). Quantile-quantile plots attesting to the appropriateness of the correction for protein length used in the combined sample SEQUEST and MASCOT analyses are shown in Supplemental Fig. 2. In contrast, analyses run without length correction (i.e. assuming that the rate of incorrect peptide identifications is independent of protein length) gave too many random matches to long proteins.


Figure 1
View larger version (24K):
[in this window]
[in a new window]

 
FIG. 1. False positive error rates in combined sample analyses. a, estimated and empirical error rates for all protein identifications from EBP analyses of peptide sequence identifications from SEQUEST and MASCOT searches separately and in combination. b, sensitivity and error rates for forward sequence protein identifications for the analysis of both SEQUEST and MASCOT search data. ROC, receiver operating characteristic curve (estimated (est.) sensitivity (sens.) plotted against estimated error).

 
Individual Versus Integrated Analysis—
In our analyses, SEQUEST identified more non-homologous proteins than MASCOT in each individual sample and in the combined analysis of all samples as exemplified by proteins whose estimated expression probabilities were ≥0.9 (Fig. 2, a–d). This discrepancy is likely to be caused at least in part by the different probability models applied to MASCOT and SEQUEST data during preprocessing (7). The number of false positives is estimated by the number of reversed sequence identifications, given in parentheses. No such false positives were identified in the overlap between SEQUEST and MASCOT results in any individual sample. In contrast, protein identifications from SEQUEST and MASCOT that were not confirmed in the other analysis were error-prone. More proteins were identified in each individual sample and in the combined analysis of all samples by integrating SEQUEST and MASCOT results than by calculating the overlap between separate SEQUEST and MASCOT analyses.


Figure 2
View larger version (50K):
[in this window]
[in a new window]

 
FIG. 2. Overlap among non-homologous forward sequence proteins with expression probability ≥0.9. Numbers in parentheses indicate estimated numbers of false positives as indicated by reversed sequence identifications. For each diagram, homology is calculated with respect to all the relevant analyses so that the numbers of proteins may not correspond to the totals from the individual analyses. a, sample 1. b, sample 2. c, sample 3. d, combined samples. e, legend for a–d. f, overlap among non-homologous proteins from different samples, for both SEQUEST and MASCOT search data, with expression probability ≥0.9.

 
The biological replicate analyses identified partially overlapping sets of proteins (Fig. 2f). Combining the samples in a single analysis was slightly more sensitive than taking the overlap among the analyses of individual samples. For the analyses integrating SEQUEST and MASCOT data, all but 26 of the proteins identified in any of the three individual samples were also identified by the combined analysis including one estimated false positive. A total of 43 proteins were identified by the combined analysis but not by the analyses of the individual samples with no estimated false positives.

The omnibus analysis of MASCOT and SEQUEST search results for all three replicates (Fig. 1b) identified 797 forward sequence proteins with an estimated probability ≥0.82 at an empirical error rate of 0.8% compared with an estimated error rate of 1%. For the 874 proteins with an estimated probability ≥0.22, the empirical error rate was 2.5% compared with an estimated error rate of just under 5%. An interactive list of all proteins identified with probability ≥0.1 together with the supporting peptide identifications is available upon request.

Comparison with ProteinProphet—
The replicated zebrafish datasets were submitted to both EBP and ProteinProphet to compare the protein sets identified by the two algorithms. Whereas EBP estimated error rates conservatively for both SEQUEST and MASCOT search results, ProteinProphet estimated rates of error accurately only for the MASCOT datasets (Table III, column a). ProteinProphet analysis of the SEQUEST search results identified excessive numbers of false positives (Table III, column b). ProteinProphet also identified too many false positives in the combined analysis of SEQUEST and MASCOT results: the protein list with an estimated error rate of 1% had an empirical error rate of 6%, and the protein list with an estimated error rate of 5% had an empirical error rate greater than 15% (Table III, column c). ProteinProphet achieved greatest sensitivity and accurate rates of error when analyzing SEQUEST and MASCOT search sets using the data integration method implemented in EBP (Table III, column d). In contrast to ProteinProphet, EBP did not identify excessive numbers of false positives for the SEQUEST and MASCOT search sets either separately or combination.


View this table:
[in this window]
[in a new window]

 
TABLE III Rates of sensitivity and empirical rates of false positive error for ProteinProphet and EBP analyses of MASCOT (column a) and SEQUEST (column b) search results for the combined D. rerio datasets

Combined analysis of SEQUEST and MASCOT search results were performed by either using standard EBP/ProteinProphet methods (column c) or the EBP method for integrating multiple search datasets prior to ProteinProphet analysis (column d).

 

    DISCUSSION
 TOP
 ABSTRACT
 EXPERIMENTAL PROCEDURES
 RESULTS
 DISCUSSION
 REFERENCES
 
Shotgun proteomics experiments, which identify proteolytic peptides, need statistical methods to infer the presence of the parent proteins. Probabilistic methods of protein identification allow sensitivity of detection to be balanced against the rate of false positive error, an improvement over traditional filtering approaches. We provide a statistical method that can integrate the results from replicate experiments and multiple searches, implemented as an analytical module that augments an existing suite of open source proteomics software (42). We used this technique to identify nearly 800 non-homologous zebrafish proteins at an empirically validated false identification rate under 1%, a useful resource for selecting protein targets for quantification in high throughput genetic (28) and pharmacological screens (32).

We performed a self-validating analysis on these complex replicate samples by searching against a database containing reversed sequences to obtain both empirical and model-based estimates of false positive error. This validation method suffers the limitation that many of the characteristics of the forward sequences are retained in the distractors, potentially resulting in an excessively conservative analysis (37). For example, spectra for which a correct match does not exist in the original search space, but that match with high probability to reversed sequences, may cause spurious false positives. Care should be taken to ensure a sufficiently complete search with regard to the specified proteins and possible peptide modifications, mutations, and nonspecific cleavages. One of the assumptions of the forward and reverse sequence database search validation method is that the search space includes all the peptides that are actually present in the sample. Specifying an inadequate search space may result in the search finding matches among the reversed sequences that are actually correct identifications rather than false positives due to random error. This violates the statistical model and may invalidate the results of the analysis.

Analysis of a standard protein mixture suggested that our model may achieve more accurate error estimates than the ProteinProphet algorithm (19) at the level of a single replicate. Moreover ProteinProphet is not designed to analyze multiple datasets and on the evidence of our analyses tends to accumulate false positive protein identifications as more data are added. Indeed in the asymptotic case every protein in the dataset would be identified with probability 1 (20). In contrast, EBP explicitly models biological replicates allowing flexible hypothesis testing. The EBP approach to integrating multiple search results is applicable independently of the statistical model and was successfully applied to ProteinProphet.

Consistent with other studies (1012), consensus scoring from multiple searches improved sensitivity and accuracy. Sensitivity was increased further by combining the results of biological replicates in a single analysis. Future investigations may adapt this statistical approach to datasets acquired using a variety of technical methods, for example in application to large multisite projects (38, 39), and to support methods of protein quantitation (40).


   FOOTNOTES
 
Received, September 6, 2006, and in revised form, November 16, 2006.

Published, MCP Papers in Press, December 12, 2006, DOI 10.1074/mcp.T600049-MCP200

1 A. R. Jones, M. Miller, R. Aebersold, R. Apweiler, C. A. Ball, A. Brazma, J. DeGreef, N. Hardy, H. Hermjakob, S. J. Hubbard, P. Hussey, M. Igra, H. Jenkins, R. K. Julian, Jr., K. Laursen, S. G. Oliver, N. W. Paton, S.-A. Sansone, U. Sarkans, C. J. Stoeckert, Jr., C. F. Taylor, P. L. Whetzel, J. A. White, P. Spellman, and A. Pizarro, manuscript in review. Back

2 C. F. Taylor, N. W. Paton, K. S. Lilley, P.-A. Binz, R. K. Julian, Jr., A. R. Jones, W. Zhu, R. Apweiler, R. Aebersold, E. W. Deutsch, M. Macht, M. Mann, T. A. Neubert, S. D. Patterson, S. L. Seymour, A. Tsugita, I. Xenarios, and H. Hermjakob, manuscript in review. Back

3 The abbreviations used are: EBP, Empirical Bayes Protein identifier; SCX, strong cation exchange; EM, Expectation-Maximization. Back

* This work was supported in part by National Institutes of Health Grants R01CA95586, P50 HL70128, P50 HL81012, MO1RR00040, HL 54500, HL 62250, and HL 70128; American Heart Association National Scientist Development Grant 0430148N (to T G.), and the Cardiovascular Institute of Philadelphia (to T G.). The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. Back

S The on-line version of this article (available at http://www.mcponline.org) contains supplemental material. Back

The A. N. Richards Professor of Pharmacology. Back

|| The Elmer Bobst Professor of Pharmacology. Back

** To whom correspondence should be addressed: The Inst. for Translational Medicine and Therapeutics, University of Pennsylvania, 809 BRB II/III, 421 Curie Blvd., Philadelphia, PA 19104. Tel.: 215-573-7600; Fax: 215-573-9004; E-mail: tilo{at}spirit.gcrc.upenn.edu


    REFERENCES
 TOP
 ABSTRACT
 EXPERIMENTAL PROCEDURES
 RESULTS
 DISCUSSION
 REFERENCES
 

  1. Aebersold, R., and Mann, M. (2003) Mass spectrometry-based proteomics. Nature 422, 198 –207[CrossRef][Medline]

  2. Lockhart, D. J., and Winzeler, E. A. (2000) Genomics, gene expression and DNA arrays. Nature 405, 827 –836[CrossRef][Medline]

  3. Hood, L., Heath, J. R., Phelps, M. E., and Lin, B. (2004) Systems biology and new technologies enable predictive and preventative medicine. Science 306, 640 –643[Abstract/Free Full Text]

  4. Sadygov, R. G., Cociorva, D., and Yates, J. R., III (2004) Large-scale database searching using tandem mass spectra: looking up the answer in the back of the book. Nat. Methods 1, 195 –202[CrossRef][Medline]

  5. Perkins, D. N., Pappin, D. J., Creasy, D. M., and Cottrell, J. S. (1999) Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20, 3551 –3567[CrossRef][Medline]

  6. Eng, J. K., McCormack, A. L., and Yates, J. R., III (1994) An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5, 976 –989

  7. Keller, A., Nesvizhskii, A. I., Kolker, E., and Aebersold, R. (2002) Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. 74, 5383 –5392[Medline]

  8. Lopez-Ferrer, D., Martinez-Bartolome, S., Villar, M., Campillos, M., Martin-Maroto, F., and Vazquez, J. (2004) Statistical model for large-scale peptide identification in databases from tandem mass spectra using SEQUEST. Anal. Chem. 76, 6853 –6860[Medline]

  9. Hernandez, P., Muller, M., and Appel, R. D. (2006) Automated protein identification by tandem mass spectrometry: issues and strategies. Mass Spectrom. Rev. 25, 235 –254[CrossRef][Medline]

  10. Elias, J. E., Haas, W., Faherty, B. K., and Gygi, S. P. (2005) Comparative evaluation of mass spectrometry platforms used in large-scale proteomics investigations. Nat. Methods 2, 667 –675[CrossRef][Medline]

  11. Kapp, E. A., Schutz, F., Connolly, L. M., Chakel, J. A., Meza, J. E., Miller, C. A., Fenyo, D., Eng, J. K., Adkins, J. N., Omenn, G. S., and Simpson, R. J. (2005) An evaluation, comparison, and accurate benchmarking of several publicly available MS/MS search algorithms: sensitivity and specificity analysis. Proteomics 5, 3475 –3490[CrossRef][Medline]

  12. Resing, K. A., Meyer-Arendt, K., Mendoza, A. M., Aveline-Wolf, L. D., Jonscher, K. R., Pierce, K. G., Old, W. M., Cheung, H. T., Russell, S., Wattawa, J. L., Goehle, G. R., Knight, R. D., and Ahn, N. G. (2004) Improving reproducibility and sensitivity in identifying human proteins by shotgun proteomics. Anal. Chem. 76, 3556 –3568[Medline]

  13. States, D. J., Omenn, G. S., Blackwell, T. W., Fermin, D., Eng, J., Speicher, D. W., and Hanash, S. M. (2006) Challenges in deriving high-confidence protein identifications from data gathered by a HUPO plasma proteome collaborative study. Nat. Biotechnol. 24, 333 –338[CrossRef][Medline]

  14. Nesvizhskii, A. I., and Aebersold, R. (2005) Interpretation of shotgun proteomic data: the protein inference problem. Mol. Cell. Proteomics 4, 1419 –1440[Abstract/Free Full Text]

  15. Eddes, J. S., Kapp, E. A., Frecklington, D. F., Connolly, L. M., Layton, M. J., Moritz, R. L., and Simpson, R. J. (2002) CHOMPER: a bioinformatic tool for rapid validation of tandem mass spectrometry search results associated with high-throughput proteomic strategies. Proteomics 2, 1097 –1103[CrossRef][Medline]

  16. Tabb, D. L., Eng, J., and Yates, J. R., III (2001) Protein identification by SEQUEST, in Proteome Research: Mass Spectrometry (James, P., ed) pp.125 –132, Springer, New York

  17. Tabb, D. L., McDonald, W. H., and Yates, J. R., III (2002) DTASelect and Contrast: tools for assembling and comparing protein identifications from shotgun proteomics. J. Proteome Res. 1, 21 –26[Medline]

  18. Eriksson, J., and Fenyö, D. (2004) Probity: a protein identification algorithm with accurate assignment of the statistical significance of the results. J. Proteome Res. 3, 32 –36[CrossRef][Medline]

  19. Nesvizhskii, A. I., Keller, A., Kolker, E., and Aebersold, R. (2003) A statistical model for identifying proteins by tandem mass spectrometry. Anal. Chem. 75, 4646 –4658[Medline]

  20. Sadygov, R. G., Liu, H., and Yates, J. R., III (2004) A statistical model for identifying proteins by tandem mass spectrometry. Anal. Chem. 76, 1664 –1671[Medline]

  21. Moore, R. E., Young, M. K., and Lee, T. D. (2002) Qscore: an algorithm for evaluating SEQUEST database search results. J. Am. Soc. Mass Spectrom. 13, 378 –386[CrossRef][Medline]

  22. Elias, J. E., Gibbons, F. D., King, O. D., Roth, F. P., and Gygi, S. P. (2004) Intensity-based protein identification by machine learning from a library of tandem mass spectra. Nat. Biotechnol. 22, 214 –219[CrossRef][Medline]

  23. Peng, J., Elias, J. E., Thoreen, C. C., Licklider, L. J., and Gygi, S. P. (2003) Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome. J. Proteome Res. 2, 43 –50[CrossRef][Medline]

  24. Chamrad, D., and Meyer, H. E. (2005) Valid data from large-scale proteomics studies. Nat. Methods 2, 647 –648[CrossRef][Medline]

  25. Yocum, A. K., Yu, K., Oe, T., and Blair, I. A. (2005) Effect of immunoaffinity depletion of human serum during proteomic investigations. J. Proteome Res. 4, 1722 –1731[CrossRef][Medline]

  26. Liu, H., Sadygov, R. G., and Yates, J. R., III (2004) A model for random sampling and estimation of relative protein abundance in shotgun proteomics. Anal. Chem. 76, 4193 –4201[Medline]

  27. Deutsch, E. W., Eng, J. K., Zhang, H., King, N. L., Nesvizhskii, A. I., Lin, B., Lee, H., Yi, E. C., Ossola, R., and Aebersold, R. (2005) Human Plasma PeptideAtlas. Proteomics 5, 3497 –3500[CrossRef][Medline]

  28. Haffter, P., Granato, M., Brand, M., Mullins, M. C., Hammerschmidt, M., Kane, D. A., Odenthal, J., van Eeden, F. J., Jiang, Y. J., Heisenberg, C. P., Kelsh, R. N., Furutani-Seiki, M., Vogelsang, E., Beuchle, D., Schach, U., Fabian, C., and Nusslein-Volhard, C. (1996) The identification of genes with unique and essential functions in the development of the zebrafish, Danio rerio. Development 123, 1 –36[Abstract]

  29. Grosser, T., Yusuff, S., Cheskis, E., Pack, M. A., and FitzGerald, G. A. (2002) Developmental expression of functional cyclooxygenases in zebrafish. Proc. Natl. Acad. Sci. U S A. 99, 8418 –8423[Abstract/Free Full Text]

  30. Pini, B., Grosser, T., Lawson, J. A., Price, T. S., Pack, M. A., and FitzGerald, G. A. (2005) Prostaglandin E synthases in zebrafish. Arterioscler. Thromb. Vasc. Biol. 25, 315 –320[Abstract/Free Full Text]

  31. Ishihama, Y., Sato, T., Tabata, T., Miyamoto, N., Sagane, K., Nagasu, T., and Oda, Y. (2005) Quantitative mouse brain proteomics using culture-derived isotope tags as internal standards. Nat. Biotechnol. 23, 617 –621[CrossRef][Medline]

  32. Zon, L. I., and Peterson, R. T. (2005) In vivo drug discovery in the zebrafish. Nat. Rev. Drug Discov. 4, 35 –44[CrossRef][Medline]

  33. Westerfield, M. (1995) The Zebrafish Book, 3rd Ed., pp.75 –107, University of Oregon, Eugene, OR

  34. Kimmel, C. B., Ballard, W. W., Kimmel, S. R., Ullmann, B., and Schilling, T. F. (1995) Stages of embryonic development of the zebrafish. Dev. Dyn. 203, 253 –310[Medline]

  35. Kersey, P. J., Duarte, J., Williams, A., Karavidopoulou, Y., Birney, E., and Apweiler, R. (2004) The International Protein Index: an integrated database for proteomics experiments. Proteomics 4, 1985 –1988[CrossRef][Medline]

  36. Keller, A., Purvine, S., Nesvizhskii, A. I., Stolyar, S., Goodlett, D. R., and Kolker, E. (2002) Experimental protein mixture for validating tandem mass spectral analysis. Omics 6, 207 –212[CrossRef][Medline]

  37. MacCoss, M. J. (2005) Computational analysis of shotgun proteomics data. Curr. Opin. Chem. Biol. 9, 88 –94[CrossRef][Medline]

  38. Desiere, F., Deutsch, E. W., Nesvizhskii, A. I., Mallick, P., King, N. L., Eng, J. K., Aderem, A., Boyle, R., Brunner, E., Donohoe, S., Fausto, N., Hafen, E., Hood, L., Katze, M. G., Kennedy, K. A., Kregenow, F., Lee, H., Lin, B., Martin, D., Ranish, J. A., Rawlings, D. J., Samelson, L. E., Shiio, Y., Watts, J. D., Wollscheid, B., Wright, M. E., Yan, W., Yang, L., Yi, E. C., Zhang, H., and Aebersold, R. (2005) Integration with the human genome of peptide sequences obtained by high-throughput mass spectrometry. Genome Biol. 6, R9[CrossRef][Medline]

  39. Martens, L., Hermjakob, H., Jones, P., Adamski, M., Taylor, C., States, D., Gevaert, K., Vandekerckhove, J., and Apweiler, R. (2005) PRIDE: the proteomics identifications database. Proteomics 5, 3537 –3545[CrossRef][Medline]

  40. Ong, S.-E., and Mann, M. (2005) Mass spectrometry-based proteomics turns quantitative. Nat. Chem. Biol. 1, 252 –262[CrossRef][Medline]

  41. Sprague, J., Clements, D., Conlin, T., Edwards, P., Frazer, K., Schaper, K., Segerdell, E., Song, P., Sprunger, B., and Westerfield, M. (2003) The Zebrafish Information Network (ZFIN): the zebrafish model organism database. Nucleic Acids Res. 31, 241 –243[Abstract/Free Full Text]

  42. Pedrioli, P. (2004) A bioinformatic pipeline for the analysis of proteomics data, in the 51st ASMS Conference on Mass Spectrometry and Allied Topics, Montreal, Canada, June 7–12, 2004, American Society for Mass Spectrometry (ASMS), Santa Fe, NM


Add to CiteULike CiteULike   Add to Complore Complore   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us   Add to Digg Digg