Protein Interactions

High throughput methods for detecting protein interactions require assessment of their accuracy. We present two forms of computational assessment. The first method is the expression profile reliability (EPR) index. The EPR index estimates the biologically relevant fraction of protein interactions detected in a high throughput screen. It does so by comparing the RNA expression profiles for the proteins whose interactions are found in the screen with expression profiles for known interacting and non-interacting pairs of proteins. The second form of assessment is the paralogous verification method (PVM). This method judges an interaction likely if the putatively interacting pair has paralogs that also interact. In contrast to the EPR index, which evaluates datasets of interactions, PVM scores individual interactions. On a test set, PVM identifies correctly 40% of true interactions with a false positive rate of ∼1%. EPR and PVM were applied to the Database of Interacting Proteins (DIP), a large and diverse collection of protein-protein interactions that contains over 8000 Saccharomyces cerevisiae pairwise protein interactions. Using these two methods, we estimate that ∼50% of them are reliable, and with the aid of PVM we identify confidently 3003 of them. Web servers for both the PVM and EPR methods are available on the DIP website (dip.doe-mbi.ucla.edu/Services.cgi).


Introduction
One thrust of post-genomic biology is the study of the networks of protein interactions that control the lives of cells and organisms.These networks have been reconstructed by detecting pairwise interactions of proteins.To store and manage this information in a systematic way, databases have been created (1,2).These databases provide centralized access to curated experimental data.They have also emerged as resources for the investigation of the large-scale properties of biological networks, in particular their functional and evolutionary aspects (3).
In this paper we explore the usefulness of the Database of Interacting Proteins (DIP) for assessing the reliability of measurement of protein interaction.Until two years ago, when high-throughput screens of protein interaction were developed, the information within interaction databases was collected from the small scale screens in hundreds of individual research papers.The biological relevance of each interaction had often been thoroughly investigated, sometimes with a repertoire of experimental techniques and often with multiple controls, (4,5).These independent, often repeated, observations coupled with controls and curation in the peer-review process enhanced the reliability of the published data.In the past two years, high-throughput, genome-wide detections of protein interactions by yeast two hybrid (Y2H) and mass spectrometric analysis of protein complexes have tremendously increased the experimental coverage.The new methods can rapidly generate more information than was collected by traditional means in more than a decade (6)(7)(8)(9)(10).However, the large size of such datasets makes it impractical to verify individual interactions by the same methods used previously in small-scale experiments (11,12).The question then arises: Do these new, highthroughput methods of detecting interactions provide information as reliable as the smallscale experiments ?Verifying the interactions from these high-throughput methods is vital (11)(12)(13)(14)(15), because only then can the large and small-scale data be combined into one self-consistent interaction network useful for further studies.
To address these issues we have analyzed the complete set of 8063 protein-protein interactions identified in yeast, S. cerevisiae, that are described in DIP as of November 2001.We demonstrate, that the subset of interactions obtained through the highthroughput Y2H screens differs in several respects from the subset based only on the small scale, or multiple, redundant experiments.Most notably, analysis of the coexpression profiles of the interacting partners leads to the conclusion that, overall, only about 30% of the high-throughput dataset possesses the same characteristic mRNA expression features as the dataset based on the small-scale experiments.To further pinpoint the interactions within the dataset, that are likely to be correct, interactions were analyzed between protein pairs that are paralogs of the tested proteins.This resulted in the identification of ~1400 interactions likely to be correct.A reliable, self-consistent set of interactions totaling ~3000 is extracted when these ~1400 are combined with the small experiment datasets and with interactions verified by more than one experiment.

Interaction Datasets
The protein-protein interaction datasets analyzed in this work are listed in Table 1.They are all, except for the RND sets, subsets of the S. cerevisiae proteinprotein interaction network (DIP-YEAST; 8063 distinct interactions) extracted from the DIP database on Nov 19, 2001.The INT set contains all the interactions determined by one or more small-scale experiment (defined as an experiment described in a published article listing no more than 100 distinct protein-protein interactions) whereas sets EC2 and EC3 contain interactions determined by, respectively, at least 2 or 3 independent experiments.The GY2H set contains all the interactions reported in high-throughput protein-protein interaction screens (6-8, 16, 17) and GY2H' is a subset of GY2H that excludes interactions occurring only in the ITO1 set.The ITO1, ITO2, … ITO8 are subsets of GY2H that contain all the interactions reported by Ito and coworkers as identified by at least 1,2 … 8 interaction sequence tags (ISTs) in a genome-wide Y2H protein-protein interaction screen (7).These datasets (ITO1, etc.) contain fewer interactions than the numbers reported in the original paper due to some redundancy of the original dataset (interactions reported in both directions: P-P' and P'-P).Also, some of the ORFs could not be traced unambiguously to a unique SWISSPROT, PIR or Genbank entry.
The RND1-3 sets were generated by randomly selecting 100,000 protein-protein pairs from the yeast genome that are not present in DIP.They are dominated by the noninteracting pairs (less than 0.15% of the true interactions present, assuming approximately 10 interacting partners per protein) even when overestimating by a factor of two the average number of interacting partners for each protein within the S. cerevisiae genome predicted in the recent literature (14,18).

Functional Correlation
Proteins have been assigned to 44 " cellular role", 58 "functional" and 29 "compartment" categories in the Yeast Protein Database (YPD) (19,20).Cellular Role is defined as the major biological process involving the protein and function as the principal structural, regulatory or enzymatic function of the protein.The YPD categories are broad and a large percentage of proteins are associated with more than one cellular role, function or compartment (sub cellular location).
The functional annotation, cellular role and compartment, if one exists, were collected for all the S. cerevisiae ORFs from the YPD database.We counted a correlation if the two interacting proteins shared one or more annotated function in a manner analogous to Schwikowski and co workers (15).The background probability that one could expect two proteins to share a common function was calculated using all possible pairs of proteins annotated in a given category.

Expression Profile Reliability Index
The expression profile reliability index, the EPR, was extracted from the interaction datasets by solving the equality-constrained linear least squares problem by on February 1, 2008 defined by equation (2) (See Results) using LAPACK implementation of the GRQ factorization method (21) and a discrete representation of the ρ(d 2 ) distributions (up to 30 bins 1.25 wide; only bins with at least 5 counts were included in the calculations).χ 2 was calculated assuming binomial distribution of the error for the individual bins in each of the histograms.The accuracy of the fitted parameter was estimated using bootstrapping approach with 5,000 synthetic datasets as described in (22).
The Euclidean expression distance between proteins A and B, d AB , was calculated according to equation: where ) / log( is a log-ratio of the expression level of the N-th protein under the i-th conditions as customarily reported by Patrick Brown group (23).The sum is performed over a set of 12 distinct shock conditions using the data provided in the supplementary materials by Gasch et al (23).

Paralogous verification method
The paralogous verification method (PVM) validates interacting pairs using the existence of paralogous interactions.Paralogs were collected by performing intraproteome comparisons using PSI-BLAST (24).Each predicted ORF product of S.
cerevisiae served as a query sequence against the entire database of S. cerevisiae.The PSI-BLAST comparisons were performed using the BLOSUM62 substitution matrix and the seg filter to mask compositionally biased regions in the query sequence.To arrive at

Yeast interactions in DIP
The set of known protein-protein interactions in budding yeast (S. cerevisiae), as documented in DIP on November 2001, contains ~8000 distinct interactions between 4150 proteins (Table 1).Approximately 2000 of these interactions were detected by small scale experiments described in more than 800 research articles.The remainder (~6000) is derived from four independent high-throughput Y2H screens (Figure 1).Comparison of the datasets shows that the overlap of detected interactions obtained in the four studies as well as between any of these datasets and the set derived from the small-scale interaction screens is petite.This observation, made already by others (12,14,15,25) is the motivation of the present work.
There are many possible reasons for the lack of overlap.Those include, the use of different yeast strains, differences in the quantitative measures of interaction and the use of non-physiological conditions in experiments.Additionally, high-throughput proteinprotein interaction screens, such as those utilizing Y2H methods, increase the chance of identifying artifactual partners by exhaustively testing arbitrary protein-protein interactions.Those include the partners that can physically interact but which are never in close proximity to one another in the cell due to distinct sub-cellular localization or expression at different times during the life cycle.
All these factors can lead to the observation of either false negatives (interactions that cannot be detected under the conditions used) or false positives (physical interactions without biological meaning).Here we concentrate on two problems: (1) identifying the fraction of false positives within the high-throughput datasets (using EPR) and ( 2) identifying true positives (using PVM).We do this by relating the global properties of these datasets with those of the reference set of biologically-relevant interactions extracted from the DIP database.The underlying assumption of this approach is that, by the virtue of its size and diversity, this reference dataset (INT) captures the most prominent features of biologically-relevant protein-protein interactions and therefore can be used to judge the quality of other interaction datasets.

Functional Correlation
We began by asking what level of functional resemblance we can find between two interacting S. cerevisiae proteins in DIP.For this study, we divided the interacting pairs into four datasets: DIP-YEAST includes all pairs; EC3 and EC2 are datasets with greater than or equal to 3 or 2 observations supporting the interaction, respectively; and INT is the set of interactions observed in at least one small scale experiment.A full description of the subsets is given in Experimental Procedures.
Figure 2 shows the percentage agreement of function, cellular role, and compartment as defined by the Yeast Protein Database (YPD) (20,26) for the pairs.The horizontal black line gives the background percentage agreement.It shows that if we pick two proteins at random from the set with known functions, the members ~18% of pairs agree in function.The difference between the observed agreement and this background is large in all cases.
These results can be compared to those of other investigators.We find that ~66% of the DIP-YEAST pairs share one or more annotated compartments compared to the 78% found by Fields and coworkers (15) in a global analysis of 2,709 published interactions of S. cerevisiae proteins.Correlation of function was also tested in a different way by Vidal and coworkers (27) who examined whether interacting pairs were found within the same gene expression cluster.These gene expression clusters are generally believed to correspond to functional categories (27)(28)(29)(30).As with the results here, correlation of the functional categories based on the gene expression clusters was higher than random but still relatively low.
Notice in Figure 2 that the INT set and the EC2 and EC3 sets show substantially higher correlation than the DIP-YEAST set.The relative lack of agreement of compartment within the DIP-YEAST data (63%) could be, in part, due to the large number of interactions between nuclear and cytoplamic proteins (15); these are expected as there are many reports of proteins shuttled between these compartments through the nuclear pore (31).The INT dataset may show higher correlation because of a better relationship between functional annotation and protein interactions described in the small-scale studies.However, if we select random pairs of proteins from INT, as opposed to the entire set, a similar level of random correlation is observed.This points to a similar level of multiple annotation and possible crosstalk in both cases.
It should also be remembered that the annotations in these categories may have been transferred from homologous proteins without experimental confirmation and as such are subject to error.However, when we calculate the percentage correlations for the set of experimentally annotated proteins are calculated they are similar to the results described above.
Function (the principal structural, regulatory or enzymatic function) is the least conserved of the three properties.This is not surprising, as an interaction between two proteins does not demand that they share an identical function; rather it demands that they are linked in a functional network.Thus, the linkages observed between functional groups could well be biologically meaningful.For example, Schwikowski and coworkers (15) found that there are a large number of interactions between the categories of protein folding and protein translocation.Therefore, in the assessment of an individual interaction, identical assignments of function or cellular role should not always be expected; rather consideration should be given to the relationships between the functions of the proteins.
The poorer conservation of function, compartment and cellular role within the DIP-YEAST dataset than the INT, EC2 and EC3 datasets suggests that small-scale studies yield more reliable results than high-throughput studies; this calls for methodologies, which determine the reliability of a dataset and the reliability of any given interaction.Here we introduce two computational methods, which use mRNA expression data and sequence analysis respectively to assess reliability of the high-throughput datasets and to identify protein-protein interactions that are likely to be correct.An overview of the two methods is offered in Figure 3.

mRNA Expression profiles of interacting pairs: the EPR method
It has been demonstrated numerous times that functionally related genes tend to be expressed in a concerted fashion (27)(28)(29).Here, we utilize this observation to assess the quality of datasets of interacting proteins.Specifically, we define a distance measure d 2 between the expression levels of the mRNAs encoding for the members of an interacting pair (See Experimental Procedures equations).Then we characterize a dataset of protein interactions by plotting the fraction of pairs having each value of d 2 .This is the basis of the EPR (Expression Profile Reliability) Index method illustrated on the left of Figure 3.
Figure 4A shows the normalized distribution of expression level distances (d 2 ) for several sets of protein interaction data .The curve RND1 gives the distribution for randomly generated set of protein pairs.Notice, that it is the broadest distribution shown, with the lowest peak.The curve INT is for the small-scale dataset, and is seen to have the highest peak and sharpest distribution.Those differences are statistically significant (confidence level p=10 -140 ), as inferred from Kolmogorov-Smirnof test (22).We take the INT set to be a reference set of interacting proteins and the RND1 set to be representative of non-interacting proteins.
On the basis of the ρ(d 2 ) distribution curves, we define a parameter, α EPR that characterizes the expected accuracy of a dataset of protein interactions.To do so we notice that the expression-distance profile of the GY2H (genome-wide yeast 2 hybrid where ρ i and ρ n are the expression distance probability distributions for the interacting and non-interacting protein pairs and the expression profile reliability index, α EPR corresponds to the fraction of the true positives in the experimental dataset. The ρ n distribution can be obtained as the distribution of expression distances for all protein pairs within a genome, because the full genome distribution is of vast size (~9*10 6 for S. cerevisiae), and must be dominated by the non-interacting pairs.The ρ i can be approximated by the distribution of the expression distances for all the reliable interactions present in DIP-YEAST (for example INT).The latter assumption seems to be valid as the set of interactions described in DIP-YEAST is in the majority of cases obtained in a manner that did not rely on the expression levels of the interacting partners.
Therefore, it can be treated as a representative sample of the entire protein-protein interaction set, random with respect to the expression levels of the interacting proteins.
A linear least-squares fit of the GY2H dataset to the model described by Equation 1allows us to evaluate the α EPR parameter.The α EPR is calculated as 31+/-3% (Table 2) for the GY2H data, suggesting that approximately 70% of the reported pairs in this set are, in fact, false positives.In order to verify that α EPR indeed reflects the expected accuracy of the experimental results, subsets of the GY2H corresponding to varying stringency of selection were constructed as reported by Ito (7).Ito and coworkers created these sets by identifying those interactions with at least 1,2 up to 8 interaction sequence tags (ISTs), labeled here ITO1 to ITO8 respectively.As expected, the accuracy of the resulting subsets, as evaluated by α EPR , increases with increased selection stringency (Figure 4B).This indicates that the EPR index can be used to characterize the accuracy of experimental, large-scale protein-protein interaction datasets, and corresponds crudely to the fraction of pairs that is biologically meaningful.However, the error on α EPR increases rapidly with decreasing dataset size, therefore limiting the applicability of EPR in general to large (> 500 interaction) datasets.
The error-prone high-throughput Y2H screens can be filtered by excluding the least reliable protein pairs that occur only in the ITO1 set.In fact, the overall reliability of the resulting GY2H' set increases to roughly 50%, as judged by the EPR index (Table 2).However, this improved reliability comes at the price of nearly halving its size.

Using Paralogous interactions to verify protein-protein interactions: the PVM method
The reliability of a given protein interaction can be evaluated by the presence of paralogous interactions.The basis for this is that if two proteins are paralogs then the proteins that they are observed to interact with are often also paralogs.This observation is related to the notion of "interologs" proposed by Vidal and coworkers (9).
To validate a given interaction between a pair of proteins P1 and P2 all the paralogs of P1 and P2 are collected and the number of interactions observed in DIP between these two families excluding the interaction P1 to P2 are counted (Figure 3).
This count is the Paralogous Verification Method (PVM) score.
To ascertain the ability of this method (PVM) to identify true interactions and ignore false interactions, the behavior on datasets of interacting proteins must be compared to the behavior on datasets of non-interacting proteins.We generated the datasets of non-interacting proteins computationally because of the difficulty in crafting such a set from reports within the literature.The three random sets of protein interactions (RND1 to 3) described in the methods section were used as the non-interacting sets; although these sets will not be entirely free of interactions, the percentage should be very small (see Experimental Procedures).
Three sets of protein interactions were used as true interaction sets: the INT, EC2 INT set (Table 1), and can be used by PVM, but are not suitable as reference datasets for EPR because the uncertainty in α EPR is large for such small datasets (Table 2).
The efficacy of the PVM method can be illustrated by a selectivity-sensitivity curve (also known as a receiver-operator characteristic curve) shown in Figure 5.It shows that a score which selects few (~1%) false positives is sensitive to ~40% of the true interactions.That is, the method shows high specificity but a lower sensitivity.This lack of sensitivity in part reflects the lack of paralogs of some proteins.Such interactions cannot score > 0. Thus if the INT, EC3 and EC2 sets are modified to consider only those pairs where at least one of each of the pairs has more than one paralog (Figure 5) an improvement in sensitivity of ~10% is observed.The low sensitivity is therefore not solely caused by the lack of paralogs, but is perhaps due to both the lack of experimental data and, in a number of cases, a lack of any paralogous interactions.
There is another possible source of error in PVM due to the erroneous identification of interactions in Y2H experiments.For example if P1' and P2' are paralogs of P1 and P2 respectively (where P denotes a protein).It is possible that in vivo only the P1/P2 and P1'/P2' interactions take place.However, Y2H may detect interactions between P1/P2' and P2/P1' as well as the true interactions P1/P2 and P1'/P2'.The calculated error rate of PVM of ~1% suggests that this problem is small.
However, as with any computational prediction technique the results should be considered in the light of other data such as sub cellular localization or function of the proteins.
The receiver-operator characteristic curve also demonstrates that the magnitude of the score is unimportant, merely that a score greater than zero indicates a high probability that an interaction exists.Thus, if a given low reliability interaction (such as Y2H ( 32)) has paralogs but a score of zero, it can be validated either directly or by testing for a paralogous interaction.
It is clear that PVM can only be used in cases where the proteins involved in the interaction have paralogs.In S. cerevisiae 3130 of the 6356 proteins have paralogs, (~ 50%).This level of paralogs appears to be typical.Koonin and co-workers (33) found that 46% of the E. coli genome has paralogs and ~2/3 of the proteins within the COG database (34) are found to have paralogs.

Uses of EPR and PVM
EPR can assess the overall quality of an interaction dataset, but cannot assess the quality of individual interactions.Figure 4A demonstrates that, the similarity in the expression levels of interacting (INT) and non-interacting sets (RND1), as judged by the changes in the mRNA levels can vary over a large range of d 2 and overlap significantly with one another.Therefore, it is generally not possible to use the similarity of the expression profiles as a predictor of protein-protein interactions without using other sources of information.However the profiles do allow an estimation of the percentage of biologically relevant interactions within a set.
PVM, on the other hand, is able to assess the quality of individual protein-protein interactions.However, it can also estimate the total number of biologically relevant interactions within a dataset.This estimation is based on the observation that in the subsets of EC2, EC3 and INT with paralogs, ~50% of the interactions are identified by PVM (Table 1).Thus, PVM should identify ~50% of the biologically relevant interactions within any given dataset.The number of true interactions within a set can, therefore, be estimated as twice the number given by PVM.In the DIP-YEAST set only 1428 of the 6083 interactions that could score did.Thus the expected number of true interactions is around 2800 of the subset with paralogs.This suggests that~2800 of ~6000 interactions are valid, giving an error rate for the overall DIP-YEAST of around 50%.This compares well with the EPR estimation of ~47% given in Table 2.
The ability of PVM to identify roughly half the true interactions within a given dataset means that it can also be used to indicate the quality of a dataset, by means of the percentage of identified interactions.The different Ito subsets described in Results and in Experimental Procedures were examined separately using PVM and it was found that as the number of independent observations of the interactions increased from 1 to 8 the percentage of the dataset identified as correct by PVM increased (Table 1) much as the EPR index improves (Figure 4B).The efficacy of PVM can also be demonstrated by examining the EPR of the subset of DIP-YEAST selected by PVM.It demonstrates that this dataset behaves within experimental error like the INT set (Table 2).

DIP yeast interactions estimated to be correct
There are about 5600 interactions within the DIP-YEAST dataset identified solely in the genome-wide Y2H screens.These include roughly 3000 interactions which were reported by Ito and coworkers (7) as based on only single IST.Although these interactions are expected to contain many false positives (35) the results in Tables 1 and 2 demonstrate that they still contain a significant proportion of true positives and the method such as PVM is ideally suited to identify at least some of them.
A subset of the DIP-YEAST interactions believed to be correct can be identified
The PVM procedure: If two proteins P1 and P2 are considered to interact the paralogous families of P1 and P2 are collected.The number of interactions between these families within the DIP database is counted, excluding the P1 to P2 link.This count is the score of the interaction.In this case the link P1 to P2 scores 2. If this score is greater than zero the interaction is predicted to be true.INT distribution represents interacting proteins and RND1 represents noninteracting proteins.Notice, that all the curves are normalized to a unit area.
B) The dependence of the expression profile reliability index (α EPR ) calculated for the subsets of the genome-wide yeast two hybrid data from (7) on the stringency of the selection procedure as reflected by the number of IST observed.Notice that αEPR tends to increase with higher stringency of selection of interacting proteins.
Notice also that the uncertainty of αEPR grows with higher stringency, because there are fewer interactions.
interactions) set appears to be intermediate between the reference interacting (INT) and non-interacting (RND1) sets.The simplest model explaining this behavior assumes that the Y2H experiments result in two types of protein-protein pairs: the true positive (biologically-relevant interactions) pairs drawn randomly from the interacting population and false positives, drawn randomly from the non-interacting population.The resulting, overall distribution of expression distances obtained for an experimental set, ρ exp is then described by equation: yeast proteins as detected in several studies.A Venn diagram illustrates the overlap between the datasets in YEAST-DIP.Each oval represents a highthroughput Y2H study and the overlaps between the Y2H studies are given at the intersections.The number in parentheses represents those interactions that have been determined by small-scale methods (See Experimental Procedures for more details).Thus, the numbers within parentheses represent the INT set.Notice the small overlap among the datasets.

Figure 2 Figure 3 A
Figure 2

Figure 5 A
Figure 5 Figure 1 Figure 2

Figure 4 by
Figure 4 Figure 5 This set is denoted as the CORE and is available on the DIP website (http://dip.doe-mbi.ucla.edu).Four hundred and fifty four of the CORE interactions are by onFebruary 1, 2008similar to the sets believed to be correct (INT, EC2 and EC3).The gross number of interactions predicted to be correct based on the EPR index of DIP -YEAST is ~4000.Thus though PVM is able to identify putatively correct interactions with very high selectivity it is unable even with the inclusion of INT and EC2 to extract from DIP -YEAST all those interactions, which are estimated to be correct by EPR.

Table 2
EPR Index, α EPR , calculated for several subsets of DIP-YEAST (see Results and (23)ta reported by the Patrick Brown group(23)