|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
From the a Department of Cellular and Molecular Pharmacology and b The California Institute for Quantitative Biomedical Research, University of California, San Francisco, California 94158, e Department of Physiological Chemistry, Division of Biomedical Genetics, University Medical Center Utrecht, 3500 Utrecht, The Netherlands, g McKusick-Nathans Institute of Genetic Medicine, School of Medicine, The Johns Hopkins University, Baltimore, Maryland 21205, h Banting and Best Department of Medical Research, University of Toronto, Toronto, Ontario M5S 3E1, Canada, and k Howard Hughes Medical Institute, San Francisco, California 94158
| ABSTRACT |
|---|
|
|
|---|
Although a similar approach was used for protein purification and identification, the resulting datasets were subjected to different analytical methods to define PPIs and protein complexes. Gavin et al. (6) exploited a "socio-affinity" scoring system that measures the log-ratio of the number of times two proteins are observed together relative to what would be expected from their frequency in the dataset. Importantly this approach takes advantage of not only direct bait-prey connections but also indirect prey-prey relationships where two proteins are each identified as preys in a purification in which a third protein is used as bait. Krogan et al. (7), on the other hand, used a synthesis of machine learning techniques including Bayesian networks and C4.5-based and boosted stump decision trees to define confidence scores for potential interactions based on direct bait-prey observations. The two groups also used different clustering algorithms to define protein complexes from their PPI datasets. For example, Krogan et al. (7) used a Markov clustering algorithm (8) for definition of protein complexes, whereas Gavin et al. (6) utilized a different clustering approach to define complexes, each consisting of groups of proteins termed "core,""module," or "attachment". Modules were intended to represent subcomplexes that are components of several distinct complexes, and attachments were factors less stably associated with stable core complexes. Although both of these individual datasets are of high quality, it is not obvious how discrepancies between them should be resolved, and each still contains a substantial number of false positive interactions that can compromise the utility of these data for guiding more focused studies.
In this study, we merged these two datasets into a single reliable collection of experimentally based PPIs by analyzing the primary affinity purification data using a novel purification enrichment (PE) scoring system. Using a well defined reference set of manually curated PPIs, we demonstrated that our consolidated dataset is of greater accuracy than the individual sets and is comparable to PPIs defined using more conventional small scale methodologies. Although algorithms designed to detect multiprotein complexes can be highly effective for extracting additional information from noisy and incomplete datasets, attempting to strictly define protein complexes may not be the optimal way to analyze such a high confidence dataset. In particular, any partitioning analysis must either group together distinct complexes that share one or more subunits or fail to correctly identify all of the components of such complexes. Additionally weak interactions between proteins or protein complexes may be lost. In this work, we subjected the entire high confidence PPI dataset to a relatively unbiased hierarchical clustering from which one can more easily identify shared components of distinct complexes as well as weak associations between complexes. We argue that this representation provides a convenient tool for biologists to gather information about a protein of interest rapidly. Finally this depiction potentially mimics the in vivo environment: a continuum of weak associations between stable protein complexes.
| EXPERIMENTAL PROCEDURES |
|---|
|
|
|---|
PE scores were motivated by the probabilistic framework of a (naïve) Bayes classifier. In a Bayes classifier, an estimate of the probability of one hypothesis (here that an interaction is real) relative to the probability of a second hypothesis (here that the interaction is not real), given a set of observations, is calculated to determine which hypothesis is more likely. Both of these probabilities are calculated using Bayes theorem, and a discriminant function f is calculated as the log-ratio of these probabilities. An interaction is classified as real if f > 0 and false if f < 0 (9). The function f is defined as
![]() |
where P(true_PPI) and P(false_PPI) represent prior expectations for the fraction of all protein pairs that do and do not interact physically. The above equation can be rewritten as follows.
![]() |
Although the accuracy of a Bayes classifier will rely on an appropriate value for P(true_PPI) and the correct value is not obvious, an incorrect choice of this value will not affect the ordering of scores for putative interactions. We therefore computed PE scores as a sum of the evidence supporting or disaffirming each potential interaction over all relevant purifications in the dataset. For a particular observation, this evidence was computed as an estimate of the corresponding term in the above sum.
![]() |
A PE score of 0 then indicates that no evidence for or against the validity of a particular interaction was collected (and in theory the probability that such an interaction is true should be equal to the prior estimate of P(true_PPI)). In particular, we considered two types of observations in the construction of PE scores: bait-prey observations when one of the proteins of interest was used as a bait and prey-prey observations when the two proteins of interest both appeared as preys in the purification of a third protein. As a result, similar to socio-affinity scores (6), PE scores can be written as a sum of direct bait-prey components (S) and an indirect prey-prey component (M). Thus, for a potential interaction between proteins i and j,
![]() |
where Sij measures evidence from purifications where protein i was used as bait, Sji measures evidence from purifications where protein j was used as bait, and Mij measures indirect evidence due to co-occurrence of proteins i and j as preys in the same purifications. Below we give detailed equations used to compute the S and M components,
![]() |
where each value of k indicates a distinct purification in which protein i was used as bait and sijk represents the corresponding evidence computed using Equation 3. The probabilities P(observation | true_PPI) and P(observation | true_PPI) used to define sijk were calculated based on estimates of two underlying probabilities: r representing the probability that a true association will be preserved and detected in a purification experiment and pijk representing the probability that a bait-prey pair will be observed for nonspecific reasons. Using these quantities, we calculate
![]() |
if protein j appeared as a prey in purification k using bait i and
![]() |
otherwise. Values for r and pijk could in principle be estimated in a number of ways. Here we estimated r using the observed frequency of successful purification over a very high confidence set of interactions (the intersection of MIPS complexes and MIPS small scale experiments). For the Krogan et al. (7), Gavin et al. (6), and Ho et al. (4) data, this gave values of 0.51, 0.62, and 0.265, respectively. For pijk we used an estimate of the probability that a given bait-prey pair would be observed for nonspecific reasons at least once in the dataset, calculated using the Poisson distribution as
![]() |
where nikprey is the number of preys identified in purification k with bait i, nibait is the number of times protein i was used as bait, and fj is an estimate of the nonspecific frequency of occurrence of prey j in the dataset. The relative values of the fj are estimates of relative rates at which different preys occur nonspecifically (and can be considered measures of relative promiscuity), and the sum of the fj can be considered to be the fraction of all prey identifications that are nonspecific. Although alternate strategies could be used, for simplicity we allowed the sum of the fj to be 1, and we computed fj as Bayesian posterior estimates based on the observed frequency of occurrence of preys in the dataset and the prior hypothesis that all preys occur nonspecifically with equal frequency,
![]() |
where njprey_obs is the total number of observations of protein j as a prey, ntotprey_obs is the total number of observations of all preys, ndistinct_preys is the number of distinct preys observed, and npseudo is a number of pseudocounts added for each prey that determines the weight given to the prior hypothesis. Values of 20, 10, and 5 were used for npseudo for the Krogan et al. (7), Gavin et al. (6), and Ho et al. (4) datasets, respectively. The value of npseudo was the only parameter adjusted to optimize the PE scoring system. Adjustments were done using the MIPS complexes as a reference, and for this reason results of all comparisons made using a reference set based on the MIPS complexes were duplicated using an independent reference set generated from the SGD complexes.
The M component was calculated as
![]() |
where each value of k indicates one purification in which proteins i and j were simultaneously observed as preys. In this case, our approach differs slightly from the full Bayesian classifier approach, which would either sum over all purifications or sum over all purifications in which at least one of the two proteins was identified as a prey. We did not use a sum over all purifications because it would require an enormous number of calculations and because estimation of all of the relevant probabilities is itself a very difficult problem. We instead created an approximate implementation of Equation 3 for mijk calculated only for observations where both preys were observed in the same purification. Significantly we did not include a negative term for the case in which only one of the two proteins was observed as a prey in a purification. This was because two proteins can interact yet also be components of alternate complexes. Our implementation was again based on estimates for two underlying probabilities. Here we used r to represent the probability that a true association between proteins i and j will be preserved and detected during a purification experiment and pijk to represent the probability that proteins i and j will appear as preys in the same purification for nonspecific reasons.
![]() |
We used the same estimate for r as calculated above, and for pijk we used an estimate of probability that proteins i and j will occur nonspecifically as preys in the same purification at least once in the dataset. This value for pijk is calculated using the Poisson distribution as
![]() |
where fi and fj are computed as described above, and ntotprey-prey is the total number of prey-prey pairs observed in the dataset.
The Krogan et al. (7) and Gavin et al. (6) data were combined by computing a score for each putative interaction independently over each dataset and adding them as follows.
![]() |
This weighted sum was used instead of a straight sum because empirically it was a more effective predictor of PPIs, and in practice this may be due to redundancy of the Krogan et al. (7) LCMS-MS and MALDI-TOF data.
Clustering of PPI Data
First, scaled PE scores were computed for use in hierarchical clustering to minimize variation in scores that does not correspond to variation in the reliability of the represented interactions. For example, PE scores of 10 and 20 may both correspond to extremely reliable interactions, but a score of 0 likely indicates a non-interaction. The scaled scores range from 0 to 1 and were intended to approximate confidence values (i.e. a scaled score of 0.8 would correspond to 80% likelihood of a true interaction). However, these values were not carefully trained and should not be taken as reliable confidence values. Equations used for calculating these values are detailed below. A vector of scaled PE scores was then created for each protein that had at least one scaled score of 0.2 or higher (corresponding to a PE score threshold of 1.85). A value of 1 was assigned for the diagonal elements (representing self-interaction) so that interacting proteins would tend to cluster together. These data were then hierarchically clustered using the uncentered correlation metric and the average linkage method with the Cluster 3.0 program (10). Results were visualized, and figure images were created using the Java Treeview program (50).
Scaled scores represent a monotonic mapping of PE scores onto the interval 01. They would represent confidence values given the approximations that 1) binary interactions in MIPS complexes represent an unbiased subset of the set of all true binary protein-protein interactions, 2) MIPS small scale experiments are
95% accurate, and 3) the set of MIPS complexes is independent of the results contained in MIPS small scale experiments. They were computed using the slope of a "coverage curve" of the cumulative number of interactions detected that were annotated in MIPS complexes versus the total number of interactions identified (see Supplemental Fig. 3). For each PE score, a corresponding slope in the coverage curve was computed by local linear regression. The resulting slopes were made monotonic (as a function of PE score) and smoothed using the pool adjacent violators algorithm (11) and LOESS regression (12). To convert these slopes to scaled scores, they were divided by the fraction of interactions included in the MIPS small scale experiments (excluding two-hybrid studies) that were also contained in the MIPS complexes (461 of 1081). The resulting values were multiplied by 0.95, and an upper bound of 0.99 was applied. Scaled scores below 0.05 were set to 0 for computational expediency.
Gene Ontology (GO) and GOslim Annotations
GO (13) and GOslim annotations were obtained from SGD (14) on March 7, 2006. Any feature annotated as ORF, pseudo_gene, or transposable_element_gene in SGD was used to calculate the total number of proteins in each GOslim category.
MIPS and SGD Complexes
MIPS complexes were obtained from the MIPS database on March 7, 2006 using the FunCat scheme version 2.0 (15). SGD complexes were extracted from the SGD database using the GO cellular component annotations. GO annotations containing the words "complex," "subunit," "ribosome," "proteasome," "nucleosome," "repairosome," "degradosome," "apoptosome," "replisome," "holoenzyme," or "snRNP" (small nuclear ribonucleoprotein) were used to assign proteins with the same GO annotation to a complex.
MIPS Small Scale Experiments
A collection of 1081 putative protein-protein interactions identified in small scale experiments was obtained from the MIPS database on March 7, 2006 (15). Two-hybrid experiments were excluded from this set because they appeared to be of lower accuracy. The collection from MIPS was used rather than the larger collection contained in the BioGRID database (16) because the collection in MIPS appeared to be of greater accuracy by each of the metrics we considered.
True Positive and True Negative Calculation
True positives were calculated for PPIs within complexes (for MIPS and SGD). True negatives were taken to be connections between proteins in different complexes if the proteins have a different subcellular localization according to Huh et al. (17) and Kumar et al. (18) or show significant mRNA expression anticorrelation (calculated using a standard correlation coefficient, distance >1.108328 (corresponds to R < 0.108328 or a P < 0.001) over a set of 1000 microarray experiments (19)).
Receiver Operating Characteristic (ROC) Curve Calculations
ROC curves were calculated using PE (and in some cases socio-affinity) scores calculated for all pairs of proteins in the full reference set. Thus a sensitivity value of 1 indicates detection of all true positive examples in the reference set, and a 1 specificity value of 1 indicates detection of all true negative examples in the reference set. For all ROC curves plotted on the same graph, an identical reference set was used to calculate the curves.
Supporting Website and Database
A searchable website, which contains all the PE scores and PPI clustering, has been created at interactome-cmp.ucsf.edu using Perl, hypertext preprocessor, and a PostgreSQL relational database.
Diploid Bimater Assay
To compare yol054w
/yol054w
cells to wild type, 1-cm2 patches of each were made from independent single colonies, replica-plated to a lawn of tester cells, cultured for 6 h at 30 °C, and again replicated to medium selective for rare matings (20). The number of colonies on each patch was counted manually with the median number of colonies on each patch being used to calculate -fold change (mutant/wild type ratio). Selection was based on histidine prototrophy because experimental genotypes were MATa/MAT
, his3
/his3
(control) or MATa/MAT
, his3
/his3
, yol054w
/yol054w
(experiment), and the mating testers were MATa his1 or MAT
his1.
a-like Faker Assay
To compare MAT
yol054w
his3 to MAT
his3, 1-cm2 patches from independent single colonies were replica-plated to medium selective for rare matings based on histidine prototrophy as above (21).
| RESULTS |
|---|
|
|
|---|
![]() |
Motivated by this framework, we created a novel metric, which we term the PE score. For each putative interaction, this score is a sum of the evidence calculated for each relevant observation in a dataset (detailed equations are provided under "Experimental Procedures").
By several independent metrics including the ability to predict membership in previously annotated complexes, the PE scores appear to identify interactions of higher confidence than the socio-affinity scores of Gavin et al. (6) (Supplemental Fig. 1). PE scores also performed better than scores that only took advantage of the direct bait-prey data from purification experiments (Fig. 1A, "Krogan PPI" point and data not shown). The use of indirect prey-prey information was also a component of the socio-affinity score, and it is conceptually related to a computational approach taken to predict PPIs based on shared interaction partners (22). Although it is clear from those studies (and our own) that there is a wealth of information contained in inferences from indirect prey-prey associations, some care should be taken with interactions inferred solely in this way as it appears that incorrect linkages may occasionally be inferred between proteins sharing a large number of common interaction partners. For this reason, we preserved annotations indicating which interactions were and were not observed directly (see below). We also note that, given a set of purification results, a PE score can be computed for any pair of proteins including, but not limited to, pairs of proteins for which direct or indirect evidence for an interaction was observed. Pairs that never co-purified will either be assigned scores of 0 (if neither protein was used as a bait) or negative scores, indicating that evidence against the potential interactions was collected. Finally it is important to be aware that the negative interaction data may exhibit some bias with respect to tagging artifacts, protein abundance, and mass spectrometry issues; however, we found that including this information in the analysis increases the quality of the final dataset.
|
A High Confidence Consolidated Dataset
Subjecting the Gavin et al. (6) and Krogan et al. (7) datasets to the same PE log-likelihood scoring function allowed us to directly combine them into a single comprehensive set that encompasses all of the high throughput TAP purification experiments completed to date. We computed combined scores from both the Krogan et al. (7) and Gavin et al. (6) datasets (see "Experimental Procedures" for detailed equations), and not surprisingly, this consolidated dataset provided greater coverage and accuracy than either of the individual datasets (Fig. 1A and Supplemental Fig. 3). In particular, it was possible to capture
50% of the previously reported interactions within protein complexes, although the true coverage may be substantially higher because this reference set likely still contains false positives. We chose not to include the Ho et al. (4) data in our consolidated dataset because it was created using a different experimental method, and its inclusion resulted in negligible changes to the resulting ROC curves (data not shown).
Using the true positive and true negative sets of protein pairs described above not only allowed us to compare the processed results of this consolidated dataset to previous high throughput datasets, but it also provided an opportunity to compare our new results with those obtained in small scale experiments that are often taken as a standard for high accuracy (24, 25). Consistent with earlier analyses, we found that previous high throughput efforts did not reach the level of accuracy obtained in small scale studies (25). However, using the consolidated dataset, it was possible to define a large set of PPIs with the same calculated true positive to true negative rate as the collection of 1081 pairwise interactions obtained from small scale experiments (excluding two-hybrid studies) in the MIPS database (Fig. 1A and Supplemental Fig. 4). This true positive to true negative rate suggests a score threshold (of 3.19) that defines a set of 9074 high confidence interactions among 1622 distinct proteins. Consistent with an earlier analysis based on smaller protein-protein interaction networks (26), we found that this network, which is probably enriched for stable interactions relative to more transient ones, is not scale-free (i.e. although the network contains a substantial number of nodes with high degree, the node degree distribution is not described by a power law) (Supplemental Fig. 5).
The suggestion that this subset of 9074 interactions from the consolidated dataset is of comparable confidence to that of a manually curated set of interactions identified in small scale experiments was tested by three additional independent measures: subcellular co-localization, GO annotation, and mRNA co-expression. First because proteins that interact physically tend to have the same subcellular localizations (17, 18, 25), we compared the published experimentally determined localizations of the putatively interacting protein pairs. Unlike pairs identified in previous high throughput studies, we found that pairs in this high confidence set were more likely to have matching localizations than pairs identified in small scale experiments (Fig. 1B). Next we found that three different classes of GO annotations (cellular component, biological process, and molecular function) were either equally or more likely to match for pairs of interacting proteins in our new set compared with pairs derived from small scale experiments (Fig. 1B). Finally it is known that genes encoding physically interacting proteins are more likely to have similar expression profiles (10, 23, 27, 28), and so we examined the distribution of Pearson correlation coefficients between expression patterns of interacting pairs over a set of 1000 previously published microarray experiments (19). Relative to the pairs identified in small scale experiments, our new high confidence set is significantly enriched for gene pairs with highly similar expression patterns (Fig. 1C and Supplemental Fig. 4). Although this enrichment may reflect better coverage of the ribosome and proteins involved in ribosome biogenesis, the new high confidence set also shows an almost identical lack of anticorrelated gene pairs when compared with the small scale set (Fig. 1C and Supplemental Fig. 4), providing further evidence that the consolidated set of PPIs has a very low false positive rate that compares favorably to that of the MIPS small scale dataset.
Comparison of the PPIs generated in this study with ones deposited into BioGRID (16) (which is a primary source for SGD (14)) from the original studies clearly demonstrates that we have defined a more reliable dataset (Fig. 1A). In particular, the 4456 PPIs unique to our set appear to be of confidence comparable to that of the small scale experiments, whereas those unique to either the Gavin et al. (2963; Ref. 6) or Krogan et al. (4512; Ref. 7) sets deposited in the databases appear to be of markedly lower confidence as judged by cellular localization and GO annotation (Fig. 2). It should be noted that using the socio-affinity scoring system described by Gavin et al. (6) provides a dataset that, although of lower coverage and accuracy than the new datasets we define here, is of higher confidence than the set deposited in the major databases (Supplemental Fig. 1). We also note that although in general they should be considered of lower confidence, the interactions unique to the Gavin et al. (6) or Krogan et al. (7) sets are still likely to contain a number of physiologically relevant associations. The high confidence set of interactions defined here, similar to other PPI datasets derived from high throughput studies (57), shows some apparent bias toward high abundance proteins and against proteins from certain cellular compartments (such as the cell wall and the plasma membrane) (Supplemental Fig. 6). These biases probably reflect experimental limitations but may also to some extent reflect real features of the distribution of protein complexes in yeast.
|
|
As a further demonstration that hierarchical clustering of the consolidated data is potentially more informative than the lists of complexes presented in the original studies, we used a different two-color scheme (yellow and red) to highlight interactions that were not present in either the inferred protein complexes from Gavin et al. (6) or Krogan et al. (7) (Fig. 4, AD). These new interactions may have been identified due to the improved scoring system, the simultaneous consideration of both raw datasets, or a combination of these factors. Consistent with the trends observed in the ROC curves (Fig. 1A), a number of previously characterized PPIs were only seen with the new analyses. For example, six subunits of the transcriptional elongation complex Elongator have been characterized previously (4244), but only in this new representation was the smallest subunit, Elp6, actually confirmed (Fig. 4A). Similarly in our new merged PPI dataset it is clear that Sec20 is a component of the Dsl1 complex, required for stability of the Q -SNARE complex at the endoplasmic reticulum (45) (Fig. 4B), and that Dad2 and Dad3 are components of the DASH microtubule ring complex (46) (Fig. 4C).
|
| DISCUSSION |
|---|
|
|
|---|
80% of interactions accessible to the TAP approach under the conditions used.
In terms of accuracy, we demonstrate here that high throughput identification of protein-protein interactions has reached a new landmark. For the first time, this consolidated dataset can match the reliability of small scale experiments. By simultaneously analyzing the two recent studies with one scoring system and creating a single merged dataset, we were able to generate a large set of PPIs ordered according to a score that indicates the strength of experimental evidence supporting their validity. In particular, we were able to identify a large subset of
9000 of these interactions, which by several independent metrics appear to be of equal or greater accuracy than that attained in a collection of small scale experiments. More valuable than these high accuracy binary interactions, however, may be the portrait of the yeast physical interactome that emerges from them through hierarchical clustering. The weak but reproducible interactions that appear between well defined complexes or between the individual components within these complexes and other proteins can be used to generate a number of hypotheses for future research.
Although identification of stable protein complexes that survive TAP purification may be nearing saturation for S. cerevisiae, much work remains in characterizing PPIs. For example, because a precise estimate of the false positive rates for the PPI datasets presented here remains elusive, a systematic reanalysis of a subset of these putative interactions using small scale methods may be very valuable. Also further identification of transient associations between well defined complexes, perhaps by further exploiting the yeast two-hybrid system, will prove insightful. An understanding of the dynamics of protein-protein interactions in response to changes in the environment has yet to be systematically explored. Obtaining low resolution structural analyses of the defined complexes using electron microscopy and determining which protein post-transcriptional modifications are involved in mediating PPIs are also of immediate interest. Furthermore efforts should be made to more quantitatively characterize protein-protein interactions perhaps by using technologies amenable to detecting PPIs in vivo. Finally considering that such significant biological information was extracted from yeast using this approach, a similar comprehensive strategy for defining the physical interactome in more complex organisms must be endeavored.
| ACKNOWLEDGMENTS |
|---|
| FOOTNOTES |
|---|
Published, MCP Papers in Press, January 2, 2007, DOI 10.1074/mcp.M600381-MCP200
1 The abbreviations used are: PPI, protein-protein interaction; TAP, tandem affinity purification; PE, purification enrichment; ROC, receiver operating characteristic; MIPS, Munich Information Center for Protein Sequences; SGD, Saccharomyces Genome Database; GO, Gene Ontology ![]()
* The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. ![]()
S The on-line version of this article (available at http://www.mcponline.org) contains supplemental material. ![]()
c Both authors contributed equally to this work. ![]()
d Supported by a predoctoral fellowship from the Burroughs Wellcome Fund. ![]()
f Supported by Netherlands Genomics Initiative Fellowship 050-72-417. ![]()
i Supported by the Howard Hughes Medical Institute. To whom correspondence may be addressed: University of California, 1700 4th St., San Francisco, CA 94143-2540. Tel.: 415-476-2980; Fax: 415-514-2073; E-mail: weissman{at}cmp.ucsf.edu
j Supported by a Sandler Family Fellowship. To whom correspondence may be addressed: University of California, 1700 4th St., San Francisco, CA 94143-2540. Tel.: 415-476-2980; Fax: 415-514-2073; E-mail: krogan{at}cmp.ucsf.edu
| REFERENCES |
|---|
|
|
|---|