|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






,¶
,||
,**
From the
Institute for Translational Medicine and Therapeutics and
Center for Cancer Pharmacology, University of Pennsylvania, Philadelphia, Pennsylvania 19104
| ABSTRACT |
|---|
|
|
|---|
In "shotgun" proteomics studies, a protein sample is digested with a proteolytic enzyme, and the resulting peptide mixture is separated by LC prior to conducting collision-induced dissociation and MS/MS (1). This generates thousands of peptide ion fragmentation spectra containing diagnostic amino acid sequence information. Peptide sequences are inferred by matching the resulting product ion spectra to theoretical or empirical spectra in peptide sequence databases (4). No match is exact, so the peptide identifications are not certain. Database search algorithms either calculate a probability that the spectrum results from a random fragmentation (e.g. MASCOT, Ref. 5) or use heuristic scores (such as SEQUEST, Ref. 6), which can then be converted into probabilities of accurate peptide identification (7, 8). A variety of other statistical models have also been proposed to score peptide matches (9). These algorithms have to some extent complementary merits, and the sensitivity and specificity of peptide sequence identifications from MS/MS data can be improved by the use of more than one search algorithm (1012). A recent comparison of five database search programs concluded with the recommendation to use consensus scoring from at least two different algorithms and noted the need for appropriate methods of combining scores from different algorithms (11). Database search methods using different assumptions and procedures can result in markedly different sets of protein identifications (e.g. Ref. 13).
Probable peptide sequences need to be assembled to proteins (14). A simple method for identifying proteins is to filter the peptide matches using empirically derived thresholds and then sort by parent protein. Variants of this approach do not attempt to estimate false identification rates, nor do they deal in a useful way with "degenerate" peptides whose sequence matches more than one protein in the database (1517). More sophisticated methods use statistical models to calculate protein expression probabilities (12, 1820) or ranking scores (21) based on the number and quality of peptide sequence matches and other criteria, such as protein length, the size of the database, and the degeneracy of the peptide sequence identifications. These models can be empirically validated by searching against databases containing random "distractors" such as reversed protein sequences (12, 2123).
A major problem is that many protein identifications have low reproducibility (24): in general, only abundant proteins are reliably identified (25, 26). Sensitivity is improved by performing multiple technical replicates (27), perhaps using different instrumentation and technical procedures (10, 25). This creates a demand for analytical methods that can integrate data from multiple experiments, which will likely increase as common data standards emerge that facilitate the exchange of proteomic datasets from diverse sources.1,2
We present a statistical model that implements consensus scoring of peptide identifications from multiple search algorithms and combines information from independent replicate experiments in a single analysis while allowing sensitivity to be balanced against false identification rate. The model assumes that every peptide sequence that could theoretically result from enzymatic digestion of a protein in the database has a chance of being identified in the search results, whether correctly or incorrectly. The probabilities of correct identification are combined across multiple peptide searches using a function that returns the maximum probability from consensus identifications and penalizes non-consensus identifications. Both correct and incorrect peptide sequence identifications are assumed to occur at random in this "space" of peptides at rates that are governed by model parameters including protein length, estimated protein abundance, the size of the search database, and the number of peptide sequence identifications in the dataset. For each protein in the database, a likelihood ratio is calculated for the possibility that the peptide identifications whose sequence matches the protein are all incorrect. These likelihood ratios are used to estimate the expression probabilities from which updated parameter estimates are obtained. The procedure is iterated until convergence at the maximum likelihood estimates. Replicates are integrated by simultaneously estimating multiple sets of model parameters. Degenerate peptides whose sequence matches multiple proteins are treated using "Occams razor," a principle by which the smallest set of probable proteins is chosen that is sufficient to explain the peptide sequence identifications (19). An outline of the statistical model used and its implementation in the program Empirical Bayes Protein identifier (EBP)3 is described under "Experimental Procedures." The software is available upon request for download as an extension to the Windows release of the Institute of Systems Biology Trans-Proteomic Pipeline.
We applied the model to a set of three biological replicates of zebrafish proteins, searched by both SEQUEST and MASCOT algorithms. The zebrafish has emerged as an informative model organism for studies of vertebrate biology (41), genetics (28), and pharmacology (29, 30). Its genome has been recently sequenced, and an acceptably complete and non-redundant protein sequence database is available (35). The aim was to identify as many reliably expressed proteins as possible in larvae 5 days postfertilization, a developmental stage at which most organ systems are functional. This would allow the selection of candidate proteins for targeted quantitation (31) in mutagenesis (28) and chemical screens (32).
| EXPERIMENTAL PROCEDURES |
|---|
|
|
|---|
Two-dimensional Chromatography and Mass Spectrometry
Protein digestion, off-line fractionation, and LC-MS/MS were performed as described previously (25). SCX separation of the peptides was carried out on a PolySulfoethyl A column (100 x 4.6 mm, 5 µm, 300 Å; PolyLC, Columbia, MD) at a flow rate of 0.2 ml/min. The sample was loaded for 10 min with mobile phase A (10 mM ammonium formate, 25% acetonitrile, pH 3), and a gradient was run to 100% mobile phase B (500 mM ammonium formate, 25% acetonitrile, pH 6.8) to elute the peptides over 80 min. Forty fractions were collected for reversed phase separation on line to a Thermo Finnigan LTQ ion trap mass spectrometer (Thermo Electron, San Jose, CA) using ESI. Each fraction was injected onto a Vydac C18 (Everest, 150 x 1 mm, 300 Å, 5 µm; Bodmann, Aston, PA) column with 0.1% formic acid, 0.01% trichloroacetic acid in water for 10 min before a gradient (30 µl/min) was run over 120 min from 3 to 70% mobile phase B (0.1% formic acid, 0.01% trichloroacetic acid in acetonitrile). The mass spectrometer was operated in a data-dependent MS/MS mode (m/z 3002000) in which the top seven ions were subjected to fragmentation. Dynamic mass exclusion was enabled with a repeat count of 2 every 45 s for a list size of 250.
Data Processing
Raw mass spectra were converted to DTA peak lists using BioWorks Browser 3.2 (Thermo Finnigan, San Jose, CA) with the following parameter settings: peptide mass range, 3005000 Da; threshold, 10; precursor mass, ±1.4 Da; group scan, 1; minimum group count, 1; minimum ion count, 15. Searches were conducted using SEQUEST and MASCOT against a database containing both forward and reverse sequences of proteins contained in the International Protein Index (IPI) Danio rerio protein sequence database version 3.07 (35). It was specified that peptides should have a maximum of two internal cleavage sites with methionine oxidation and cysteine carbamidomethylation as possible modifications. SEQUEST searches specified that peptides should possess at least one tryptic terminus and used a peptide mass tolerance of ±1.4 Da and a fragment ion tolerance of 0. MASCOT searches specified tryptic digestion and used a peptide mass tolerance of ±1.5 Da and a fragment ion tolerance of ±0.1 Da. The search results were converted into pepXML format (42). Peptide identification probabilities for both SEQUEST and MASCOT searches were calculated by executing PeptideProphet (7). SEQUEST results were processed using the "-Ol" tag, which uses
Cn* values unchanged. Peptide sequence identifications with probabilities greater than 0.05 were exported for the protein probability analyses. EBP and ProteinProphet (19) analyses were run using SEQUEST and MASCOT results for each sample both separately and together. EBP analyses were run using the default settings except for the calculation of number of trypsin digests per protein, which specified peptides with at least one tryptic terminus for the SEQUEST analyses and two tryptic termini for the analyses of MASCOT and combined SEQUEST plus MASCOT data. Combined EBP analyses were conducted in which the results for all three replicates were analyzed together with protein probabilities calculated for the hypothesis that expression was evident in at least one of the three samples.
Statistical Model
Assumptions
Peptide Probabilities
Table I lists and describes the quantities used or estimated as part of the model. We begin with a set of MS/MS spectra to which possible peptide sequence identifications have been assigned using one or more search algorithms. Each peptide sequence identification has an associated probability ranging from 0 for random matches to 1 for peptide sequences that can be identified with absolute certainty. The first step in our procedure summarizes these probabilities. The combined probability pi s for peptide identification i of spectrum s is computed as the proportion of algorithms (e.g. MASCOT and SEQUEST) that identify peptide sequence i multiplied by the maximum probability identification. Repeated identifications of the same peptide sequence in multiple spectra of the same search are not treated as independent events but as repeated fragmentations of the same peptide resulting in spectra of varying quality. A conservative strategy is adopted by which only the highest probability identification is retained for each peptide. The resulting dataset comprises n peptide identifications, each of which is either correct with peptide probability pi or is a random match with probability 1 pi. We assume that the accuracy of each peptide identification is an independent random event. For ease of computation, the peptide probabilities may be bounded at some low threshold by ignoring, say, all pi
0.05.
|
.
The search database
contains N proteins, each of which may be expressed in our sample. We know that protein j
when subjected to enzymatic digestion theoretically results in a set of proteolytic peptides, Dj, which may include peptides with missed cleavages or non-enzymatic termini and those with specified modifications. A function of the size of this set, yj = exp [
], is a useful measure of protein length: assuming a log-normal null distribution of search scores (8), the probability that the highest scoring random match is to one of the |Dj| theoretical digests from protein j is proportional to exp [
]. Let Mj be the set of zj peptide identifications whose sequence matches protein j, i.e. those peptide identifications that among the theoretical digest products of protein j: Mj = {pi: i
Dj; i = 1, ..., n}.
Degeneracy
Our treatment of degenerate peptide identifications, whose sequence matches more than one protein in the database, follows that of ProteinProphet (19). In some instances, a group of proteins exclusively share a set of peptide sequence matches. These proteins are compiled as a "degenerate protein group," and a single expression probability is estimated for the group of proteins because any or all of them may be present in the sample, and they cannot be distinguished. Whenever a peptide matches more than one protein but is not part of such a degenerate cluster (in other words, it is part of an overlapping but not identical subset of matches), it is assumed that the protein matches are the result of either one true match, the remainder being matches due to homology with the expressed protein, or random matches due to error in matching the mass spectrum to the peptide sequence. The choice of which protein match is the correct one is treated as a multinomial event with the class probabilities given by a set of "weights" wi j,
j: i
Dj wi j = 1, whose values depend on the relative expression probabilities of the proteins j to which the peptide matches. The probability that peptide i is a true match to protein j is given by wi j pi, the probability of a homology match is given by (1 wi j)pi, and the probability of a random match is 1 pi. The initial estimates for the weights are chosen to be equal for all proteins that match peptide i.
Protein Abundance
Highly abundant proteins are likely to accumulate many more peptide matches than proteins of low abundance. Protein abundance has been shown to be strongly correlated with the number of spectra whose peptide identifications match the protein sequence at least for MS/MS using non-data-dependent acquisition (13, 26). Let us call this quantity vj, estimated by the total of weighted peptide identification probabilities for all spectra matching peptides in j. Proteins are assigned to ordinal abundance categories aj by binning these vj at preset thresholds. Most proteins have no peptide matches and are assigned to the lowest abundance category.
Parameters
Protein expression probabilities are calculated conditionally on a set of parameters that summarize the data and govern the rate of accumulation of true and random peptide matches to proteins. For proteins in abundance class a, the model parameters are
a = (Na,
a, na,
a,
a,
a). Na is the number of proteins of abundance a, and
a is their total length so that the entire protein space
contains N =
a Na proteins with a total length of
a
a. na is the total number of peptide matches to proteins of abundance a. Note that because of degeneracy
a na
n.
a is the proportion of proteins of abundance a that are also in H, and
a is their total length. Thus, H contains
a Na
a proteins with a total length of
a
a. Finally
a is the number of these peptide identifications that are correct.
Probability of Membership of H
The condition Hj = j
H is true if at least one of the peptide identifications in Mj is a correct match. We assume that correct peptide identifications are independently and randomly distributed among the proteins in H at a rate that is proportional to the effective length of the protein so that the number of correct identifications matching a protein j
H is Poisson distributed with parameter
aj·yj/
aj. We also assume that the naj
aj incorrect peptide identifications are independently randomly distributed among the proteins so that the number of incorrect identifications matching a random protein j
is Poisson distributed with parameter (naj
aj)·yj/
aj. Using Bayes theorem, the estimated odds of membership of H given the data equal the prior odds
aj/(1
aj) multiplied by the likelihood ratio of membership of H given the peptide matches Mj and the expected parameter values
aj. (The "hat" notation (^) indicates an estimated value.)
Estimation of the Protein Expression Probabilities
The condition Tj = j
T requires that at least one of the peptide identifications matching j is correct (i.e. j
H) and that this is so because the digest product is truly expressed in the sample. The estimated weights
i j together with Mj and
aj are used to calculate the probabilities of membership of T conditional on membership of H, and hence the probabilities of membership of T the true protein expression probabilities.
Estimation Using the "Expectation-Maximization" (EM) Algorithm
Maximum likelihood values for the parameters are calculated using the EM algorithm. The posterior probabilities of homology set membership hj = Pr(Hj |
, Mj) are initialized using
j =
aj[0] where
aj[0] is an initial parameter value. (The "posterior" probability means the probability conditional on the data, Mj.) At the "Expectation" step, we calculate the expected values of the parameters
given the estimated probabilities
j. For the "Maximization" step, we compute the probabilities of membership of H and T given the matching peptide identifications Mj and the expected parameter values. The peptide weights are then updated according to a function that assigns the highest weight to the protein or proteins with the greatest probability of expression. When iterated, this procedure converges to a maximum function that downweights all but the most likely proteins, thereby restricting the results set to a minimal set of proteins necessary to explain the peptide identifications. The updated weights are used to update
j and hence estimate the abundance categories âj. The effect of the Maximization step is to successively maximize the profile likelihoods
(H | M,
, â),
(T |
, M,
, â,
),
(w |
, M, â), and
(a |
,
) under the simplifying assumption that all Hj and Tj are conditionally independent given M,
, â, and
. The effect of the Expectation step is to maximize the profile likelihood
(
|
). The Expectation and Maximization steps are repeated until convergence at the maximum likelihood estimates of
, H, T, a, and w given M.
Expectation Step
The Expectation step proceeds as follows to calculate the expected values of the parameters
a for each abundance category a.
a and
a are calculated, respectively, as the number of proteins of abundance a and the number of peptide identifications matching these proteins.
a is the total length of these proteins.
a is calculated as the expected value of an unweighted sum of Bernoulli variables Hj divided by
a.
a is calculated as the expected sum of the Bernoulli variables Hj weighted by the protein "lengths." Finally
a is computed as the expected sum of Bernoulli variables pi totaled over all the peptide matches to each protein.
Maximization Step
The first part of the Maximization step consists in the calculation of maximum likelihood values for the probabilities hj given the data and the expected values of the parameters. According to Bayes theorem, the posterior odds that j
H are given by the prior odds
a/(1
a) multiplied by the Bayes factor. The Bayes factor equals the likelihood ratio of hj given the estimated parameters
aj and the peptide matches to protein j, Mj. (Henceforward the aj suffixes to the parameter estimates are omitted.) Algebraically the posterior odds that j
H are given by Equation 1.
![]() |
(The symbol ¬ indicates logical negation, equivalent to the Boolean operator NOT.) The likelihood that j
![]()
H is the probability that all zj peptide matches in Mj have arisen by chance,
![]() |
where 0 < pi
1 and Poisson(n |
) = e
n/n!. The likelihood that j
H equals the sum of the probabilities that there are s correct peptide identifications and zj s random matches in Mj, 0
s
zj,
![]() |
where 0 < pi
1 and cH(s) =
Mj (1 pi).
s set S
Mj
Mj/S odds {pi}, summing over all possible S, subsets of Mj with s members. This likelihood is undefined if pi = 0 for any i, indicating that Hj is true with probability 1. Note also that there remains the possibility that Hj is true, if all the peptide matches are random (s = 0) or if there are no peptides matching it at all (zj = 0). This corresponds to the small but non-zero probability that j is a protein that is truly expressed in the sample but to which no identified peptides match. Knowing the quantities from Equations 2 and 3 we can now estimate the probabilities hj by applying Bayes theorem using Equation 1.
The next part of the Maximization step estimates the probabilities of true protein expression. The posterior probability of Tj conditional on Hj is the probability that correct peptide identifications truly match protein j rather than a homologous sequence in a different protein,
![]() |
where 0 < pi
1 and cT(s) =
Mj (1 pi).
s set S
Mj (1
S [1
i j])
S odds {pi}, summing over all possible S, subsets of Mj with s members. Estimates of the true protein expression probabilities can be derived using Equation 5.
![]() |
The estimated protein expression probabilities
j are used to update the peptide weights according to a function that assigns the highest weight to the protein or proteins with the greatest probability of expression. These updated weights are used to calculate updated estimates
j and update the abundance classifications âj.
As an optional refinement to the model, the peptide probabilities pi can be adjusted for sequence homology. Assuming that degeneracy is independent of all the other parameters used in calculating the probabilities that the database search results are correct, the odds that peptide identification i is correct given that its sequence matches ki proteins in the search database is given by Equation 6.
![]() |
This process can be incorporated into the EM loop, updating the peptide probabilities to their posterior values at each iteration until convergence.
Extension to Multiple Replicate Experiments
The model allows integrated analysis of data from R independent replicate experiments. Separate sets of parameters, probabilities of membership of H, and probabilities of membership of T conditional on membership of H are estimated for each replicate. The overall protein expression probabilities are calculated using a combinatorial function of these probabilities: typically evidence of true expression is required from at least x replicates where 1
x
R. The peptide weights are calculated as done previously using these overall expression probabilities.
EBP Implementation of the Algorithm
A program was developed to implement the algorithm using a slight refinement of the statistical model motivated by empirical Bayesian concepts, giving rise to the name: the Empirical Bayes Protein identifier. This modification uses binomial theory to estimate maximum likelihood hyperparameters for
, assuming an underlying Gamma distribution; Equations 2, 3, and 4 are then calculated by integrating over
. This procedure helps stabilize estimates of the parameter set
, avoiding a local maximum in the likelihood at
= 0. Computational adaptations were used to optimize the speed of the algorithm for proteins with many peptide matches.
The default settings exclude peptide identifications with probabilities less than 0.5 for the purposes of calculating protein identification. When analyzing the combined results of two search algorithms, this is equivalent to including only those spectra that are matched to the same peptide by both algorithms. It is advantageous to exclude low probability peptide matches as far as possible because the likelihood function is dominated by the number of peptide hits to each protein. The default settings also specify two abundance categories, with a high threshold of vj
10 defining a category of abundant proteins, for the analysis of complex proteomes. These default values were chosen to be prudent by removing error associated with low probability peptide identifications and conservative in defining high abundance proteins.
To facilitate data integration, public data repositories need an exchangeable format and common data standards. Our implementation of the algorithm EBP plans to migrate its input and output format to the emerging analysisXML format from PSI Mass Spectrometry Working Group1,2 once this platform becomes stable. Currently EBP uses pepXML for input and an exchangeable output format, ebpXML, that closely resembles protXML. Both pepXML and protXML are open source data formats developed at the Institute of Systems Biology, Seattle, WA (42).
Analyses of a Test Protein Mixture
Mass spectra from the sample dataset, derived from electrospray LC-MS/MS of a mixture of 18 non-human proteins, were searched using SEQUEST and MASCOT against a sequence database containing human peptides plus the 18 non-human proteins and likely contaminants (36). The search results were preprocessed using PeptideProphet (7) to generate two sets of peptide sequence identifications and probabilities. Protein identifications were derived by applying the EBP and ProteinProphet (19) algorithms to the PeptideProphet outputs using the default settings. For the purposes of this study, an alteration was made to the PeptideProphet output to improve the comparability of the two algorithms. Specifically the lists of possible protein matches were merged for peptide sequences with indistinguishable mass (i.e. those with identical sequence except for leucine-isoleucine substitutions). This procedure is performed by default in the EBP software but is not performed by ProteinProphet. Proteins whose estimated expression probabilities exceeded cutoffs of p
0.9 and p
0.7 were reported. Sensitivity was calculated as the proportion of the 18 sample proteins identified. The empirical error rate was calculated as the proportion of proteins identified that were neither in the sample mixture nor known contaminants. The estimated error rate was calculated as the difference between the average expression probability and 1 for proteins passing the threshold.
Estimation and Empirical Validation of the False Positive Rate for Analyses of Complex Protein Samples
Electrospray ionization MS/MS spectra from three independent zebrafish protein samples, separated by two-dimensional LC, were searched against a concatenated database of forward and reverse zebrafish protein sequences using both SEQUEST and MASCOT. The results were preprocessed (8) to generate two sets of peptide sequence identifications and probabilities for each sample. The error rate was validated using the assumption that reversed sequence identifications are all false positives and that false positive identifications occur with equal probability to forward and reversed sequences. Hence, if the model identifies f forward sequence and r reversed sequence proteins at a given probability, the total number of false positives is empirically estimated as 2r, and the error rate of all identified proteins (i.e. both forward and reversed sequence proteins) is given by x = 2r/(r + f). The rate of false identifications to forward sequence proteins, an empirical estimate of the false positive error rate, is given by r/f = x/(2 f). Accordingly the estimated probabilities for forward sequence identifications were adjusted by the function x
(0, 1): x
2x/(x + 1), the inverse of x
(0, 1): x
x/(2 x), to obtain the true expression probability estimates. This adjustment is calculated automatically in the software implementation of EBP.
| RESULTS |
|---|
|
|
|---|
0.9) and no errors.
|
|
0.9 (Fig. 2, ad). This discrepancy is likely to be caused at least in part by the different probability models applied to MASCOT and SEQUEST data during preprocessing (7). The number of false positives is estimated by the number of reversed sequence identifications, given in parentheses. No such false positives were identified in the overlap between SEQUEST and MASCOT results in any individual sample. In contrast, protein identifications from SEQUEST and MASCOT that were not confirmed in the other analysis were error-prone. More proteins were identified in each individual sample and in the combined analysis of all samples by integrating SEQUEST and MASCOT results than by calculating the overlap between separate SEQUEST and MASCOT analyses.
|
The omnibus analysis of MASCOT and SEQUEST search results for all three replicates (Fig. 1b) identified 797 forward sequence proteins with an estimated probability
0.82 at an empirical error rate of 0.8% compared with an estimated error rate of 1%. For the 874 proteins with an estimated probability
0.22, the empirical error rate was 2.5% compared with an estimated error rate of just under 5%. An interactive list of all proteins identified with probability
0.1 together with the supporting peptide identifications is available upon request.
Comparison with ProteinProphet
The replicated zebrafish datasets were submitted to both EBP and ProteinProphet to compare the protein sets identified by the two algorithms. Whereas EBP estimated error rates conservatively for both SEQUEST and MASCOT search results, ProteinProphet estimated rates of error accurately only for the MASCOT datasets (Table III, column a). ProteinProphet analysis of the SEQUEST search results identified excessive numbers of false positives (Table III, column b). ProteinProphet also identified too many false positives in the combined analysis of SEQUEST and MASCOT results: the protein list with an estimated error rate of 1% had an empirical error rate of 6%, and the protein list with an estimated error rate of 5% had an empirical error rate greater than 15% (Table III, column c). ProteinProphet achieved greatest sensitivity and accurate rates of error when analyzing SEQUEST and MASCOT search sets using the data integration method implemented in EBP (Table III, column d). In contrast to ProteinProphet, EBP did not identify excessive numbers of false positives for the SEQUEST and MASCOT search sets either separately or combination.
|
| DISCUSSION |
|---|
|
|
|---|
We performed a self-validating analysis on these complex replicate samples by searching against a database containing reversed sequences to obtain both empirical and model-based estimates of false positive error. This validation method suffers the limitation that many of the characteristics of the forward sequences are retained in the distractors, potentially resulting in an excessively conservative analysis (37). For example, spectra for which a correct match does not exist in the original search space, but that match with high probability to reversed sequences, may cause spurious false positives. Care should be taken to ensure a sufficiently complete search with regard to the specified proteins and possible peptide modifications, mutations, and nonspecific cleavages. One of the assumptions of the forward and reverse sequence database search validation method is that the search space includes all the peptides that are actually present in the sample. Specifying an inadequate search space may result in the search finding matches among the reversed sequences that are actually correct identifications rather than false positives due to random error. This violates the statistical model and may invalidate the results of the analysis.
Analysis of a standard protein mixture suggested that our model may achieve more accurate error estimates than the ProteinProphet algorithm (19) at the level of a single replicate. Moreover ProteinProphet is not designed to analyze multiple datasets and on the evidence of our analyses tends to accumulate false positive protein identifications as more data are added. Indeed in the asymptotic case every protein in the dataset would be identified with probability 1 (20). In contrast, EBP explicitly models biological replicates allowing flexible hypothesis testing. The EBP approach to integrating multiple search results is applicable independently of the statistical model and was successfully applied to ProteinProphet.
Consistent with other studies (1012), consensus scoring from multiple searches improved sensitivity and accuracy. Sensitivity was increased further by combining the results of biological replicates in a single analysis. Future investigations may adapt this statistical approach to datasets acquired using a variety of technical methods, for example in application to large multisite projects (38, 39), and to support methods of protein quantitation (40).
| FOOTNOTES |
|---|
Published, MCP Papers in Press, December 12, 2006, DOI 10.1074/mcp.T600049-MCP200
1 A. R. Jones, M. Miller, R. Aebersold, R. Apweiler, C. A. Ball, A. Brazma, J. DeGreef, N. Hardy, H. Hermjakob, S. J. Hubbard, P. Hussey, M. Igra, H. Jenkins, R. K. Julian, Jr., K. Laursen, S. G. Oliver, N. W. Paton, S.-A. Sansone, U. Sarkans, C. J. Stoeckert, Jr., C. F. Taylor, P. L. Whetzel, J. A. White, P. Spellman, and A. Pizarro, manuscript in review. ![]()
2 C. F. Taylor, N. W. Paton, K. S. Lilley, P.-A. Binz, R. K. Julian, Jr., A. R. Jones, W. Zhu, R. Apweiler, R. Aebersold, E. W. Deutsch, M. Macht, M. Mann, T. A. Neubert, S. D. Patterson, S. L. Seymour, A. Tsugita, I. Xenarios, and H. Hermjakob, manuscript in review. ![]()
3 The abbreviations used are: EBP, Empirical Bayes Protein identifier; SCX, strong cation exchange; EM, Expectation-Maximization. ![]()
* This work was supported in part by National Institutes of Health Grants R01CA95586, P50 HL70128, P50 HL81012, MO1RR00040, HL 54500, HL 62250, and HL 70128; American Heart Association National Scientist Development Grant 0430148N (to T G.), and the Cardiovascular Institute of Philadelphia (to T G.). The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. ![]()
S The on-line version of this article (available at http://www.mcponline.org) contains supplemental material. ![]()
¶ The A. N. Richards Professor of Pharmacology. ![]()
|| The Elmer Bobst Professor of Pharmacology. ![]()
** To whom correspondence should be addressed: The Inst. for Translational Medicine and Therapeutics, University of Pennsylvania, 809 BRB II/III, 421 Curie Blvd., Philadelphia, PA 19104. Tel.: 215-573-7600; Fax: 215-573-9004; E-mail: tilo{at}spirit.gcrc.upenn.edu
| REFERENCES |
|---|
|
|
|---|