If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
Advances in proteomics technologies have enabled novel protein interactions to be detected at high speed, but they come at the expense of relatively low quality. Therefore, a crucial step in utilizing the high throughput protein interaction data is evaluating their confidence and then separating the subsets of reliable interactions from the background noise for further analyses. Using Bayesian network approaches, we combine multiple heterogeneous biological evidences, including model organism protein-protein interaction, interaction domain, functional annotation, gene expression, genome context, and network topology structure, to assign reliability to the human protein-protein interactions identified by high throughput experiments. This method shows high sensitivity and specificity to predict true interactions from the human high throughput protein-protein interaction data sets. This method has been developed into an on-line confidence scoring system specifically for the human high throughput protein-protein interactions. Users may submit their protein-protein interaction data on line, and the detailed information about the supporting evidence for query interactions together with the confidence scores will be returned. The Web interface of PRINCESS (protein interaction confidence evaluation system with multiple data sources) is available at the website of China Human Proteome Organisation.
Protein-protein interactions play important roles in defining most cellular functions (
). Traditionally protein interactions are studied individually by top-down, hypothesis-driven approaches with experiments designed to derive high quality detailed interaction information. Recently advances in proteomics technologies have enabled a large number of novel protein interactions to be detected at an unexpected speed by yeast two-hybrid screens (
) estimated that approximately half of the interactions obtained from high throughput experiments might be false positives. These false positives may connect the unrelated proteins, complicating and even misleading the elucidation of biological significance (
The abbreviations used are: HTPID, high throughput protein interaction data set; AUC, area under the ROC curve; DDI, domain-domain interaction; FP, false positive; GO, Gene Ontology; LR, likelihood ratio; ROC, receiver operating characteristic; SSBP, smallest shared biological process; STRING, search tool for the retrieval of interacting genes/proteins; TP, true positive.
1The abbreviations used are: HTPID, high throughput protein interaction data set; AUC, area under the ROC curve; DDI, domain-domain interaction; FP, false positive; GO, Gene Ontology; LR, likelihood ratio; ROC, receiver operating characteristic; SSBP, smallest shared biological process; STRING, search tool for the retrieval of interacting genes/proteins; TP, true positive.
is evaluating the reliability of the interactions and then separating the subset of credible interactions from background noise.
Several methods have been developed previously to predict the true protein interactions from the high throughput protein interaction data sets, such as data set intersection (
). Most of these methods are based on a single biological evidence. Although these “Single Evidence Models” have been proved to be of certain efficacy, none of them gain both a high specificity and a good sensitivity at the same time (
). To reduce the intrinsic false positives and false negatives from a single source, in recent years researchers have tended to integrate multiple data sources. Using experimental, topological, and Gene Ontology (
) established six criteria to evaluate their human yeast two-hybrid data. An interaction is awarded one quality point for each fulfilled criterion and then ranked according to their number of quality points. As this process is very like the voting procedure, we refer to this scoring model as the “Simple Voting Model.” There are some limitations in this voting procedure. Biological evidences have to be transformed into a binary format by setting the cutoff value, which naturally involves a loss of information. And because different cutoffs can change the results of the voting procedure, it is often very difficult to set a proper cutoff.
To eliminate these limitations in the Simple Voting Model, here we introduce a Bayesian approach to integrate multiple biological evidences for the confidence scoring system. The Bayesian approach is a probability-based derivation method, which is suitable for combining evidences from multiple heterogeneous biological features, especially robust on incomplete and uncertain data (
) integrated sequence homology, function similarity, and interacting domains to evaluate the reliability of the yeast high throughput protein interaction data and gained high sensitivity and specificity. However, their strategy did not perform well on the human protein interaction data (
) in several aspects. 1) We integrate three more genomic features: gene coexpression, genome context, and network topology. Although two of these features, gene coexpression and network topology, have been used as the individual evidence for confidence assessment (
) only use 1,479 yeast interactions in their analyses. 3) We stratify the same biological evidence into different confidence bins and then use likelihood ratios (LRs) to measure the reliability of these bins. These improvements are supposed to increase the sensitivity and specificity of the Bayesian scoring system. Lastly to facilitate use by the community, we developed this strategy into a Web service, namely PRINCESS (protein interaction confidence evaluation system with multiple data sources). PRINCESS is designed to address the requirements not only for confidence assessment of high throughput interactions but also for their biological annotation with multiple biological evidences.
The main strategy of PRINCESS is to use likelihood ratios to assess the reliability of individual biological evidences based on golden standard data sets and then to combine these individual likelihood ratios by a Bayesian model to assign confidence scores to the high throughput protein interactions (Fig. 1).
Golden Standard Data Sets
To estimate the LR of each evidence, the golden standard positive and negative data sets are constructed. Protein interaction data in the Human Protein Reference Database are all obtained by the critical literature reading (
); therefore, we download the protein interaction data deposited in the Human Protein Reference Database (released September 13, 2005) as the golden standard positive. It is difficult to find an experimental negative data set. Here we construct a golden standard negative data set consisting of the interactions between 2,110 nuclear proteins and 1,021 plasma membrane proteins obtained from the Gene Ontology consortium (
); therefore if the orthologs of a pair of interacting human proteins can interact in another model organism, the interaction will be regarded as high confidence. Here we download the model organism protein interaction data sets from Database of Interacting Proteins (
). For these model organism protein interaction data, there are many variables that might correlate with their reliability. Therefore, we use a J48 pruned tree to stratify them into various confidence bins as implemented in the Weka software package (
). Therefore, if an interaction contains a pair of interaction domains, it will be more reliable. Considering that none of the current domain-domain interaction (DDI) databases have a satisfying coverage, we integrate three available strategies in our scoring system. We download the predicted DDI data from InterDom (
) to identify the possible domain interactions. The DDI score from the InterDom database and the domain enrichment ratio are used as the explanatory variables to classify the confidence bins. In Fig. 2, B, C, and D, an apparent correlation is observed between LR and these variables, suggesting that all three strategies are suitable for confidence assessment.
Functional Annotation Data—
Interacting proteins often participate in the same biological process (
) to measure the functional similarity of a pair of proteins. From Fig. 2E, protein pairs with smaller SSBP tend to have the higher LRs in accordance with the previous report that a pair of proteins sharing a more specific GO term are more convincing (
Genes encoding interacting proteins always have a certain genome context, such as gene co-occurrence, gene neighborhoods, and gene fusion. These contexts have been used to predict the protein interactions (
). Using the confidence score in STRING, we classify genome context into different confidence bins (supplemental Fig. S2). From Fig. 2F, we found that genome context data are also suitable for confidence evaluation.
Genome-wide Gene Expression Data—
Interacting proteins tend to be coexpressed especially in the same protein complexes or biological processes (
. For each profile, gene pairs are grouped into 20 bins according to pairwise expression Pearson correlation coefficient values. Fig. 2G shows a significant correlation between the expression Pearson correlation coefficient value and the LR, suggesting that gene coexpression can also be used for confidence assessment.
Network Topological Structure Data—
Interactions in potential functional modules are of higher confidence than others because cellular functions are often carried out by stably or transiently associated groups of proteins (
). Therefore, network topological structure data are also integrated in our confidence scoring system. To identify the protein interactions with certain network topology, we combine the query interaction with the training protein interaction data to generate an integrated network for explanatory variables correlating with their confidences. For the interacting proteins 1 and 2, we define h1 and h2 as the number of their protein partners in the integrated network, respectively; h12 as the number of their shared protein partners; and Nfour as the number of the four-interaction loops in which an interaction participates. To measure the possibility of the interaction in three-interaction loops, we define hlog as Equation 1,
where P is the probability of the shared neighbor for proteins 1 and 2 (
), Cnk is the binomial coefficient for n chooses k, and total is the number of all the proteins in the network. From Fig. 2H, we can find that protein interactions in three- or four-interaction loops are of higher confidence in agreement with the previous report (
), the posterior odds (Opost) of an interaction can be calculated as the product of the prior odds (Oprior), and the likelihood ratio LR(f) can be calculated by Equation 2,
where P(positive|f) is the probability that a pair of proteins interacts after considering the biological evidence f, whereas P(negative|f) stands for the possibility that the pair does not interact. The prior odds is the ratio of the probability of detecting a pair of interacting proteins from all protein pairs that can be estimated by the golden standard data sets (Equation 3).
The LR of biological evidence f is the ratio of the probability of meeting condition f of the interacting protein pair and the non-interacting protein pair in the golden standard data sets. From Equations 1 and 2, the LR can be computed as
where T and F are the number of all the true and false interactions, respectively, and TPf and FPf are the number of true and false interactions with the biological evidence f, respectively. The advantages of Bayesian rules in this system permit us to integrate multiple heterogeneous data sources into a probabilistic model. Because these biological data types integrated in PRINCESS are obtained by different approaches, we assume that they are conditionally independent. Therefore, we can get the composite LR (LRcomp) by simply multiplying the LRs from individual sources, which is namely the naïve Bayesian network (Equation 5).
According to the Bayesian rules described above, during the assessment procedure PRINCESS first finds the supporting evidences for the query interaction and assigns it the LR values. If the biological evidences in a same data type give more than one LR, the maximum will be retained. And then the naïve Bayesian network is used to integrate these LRs from multiple types of data sources to generate LRcomp for confidence assessment (Fig. 1).
Receiver Operating Characteristic (ROC) Curve and Cross-validation
A ROC curve can show the efficacy of one test by presenting both sensitivity and specificity for different cutoff points (
). Sensitivity and specificity can measure the ability of a test to identify true positives and false positives in a data set. These two features can be calculated as Sensitivity = TP/T and Specificity = 1 − (FP/F) where TP and FP are the number of identified true and false positives, respectively, whereas T and F are the total number of positives and negatives in a test. The ROC curves are plotted and smoothed by SPSS software with the sensitivity on the y axis and 1 − Specificity on the x axis (
To test the efficacy of the overall performance of various assessment models, the 5-fold cross-validation protocol is used. The golden standard positive and negative data sets are randomly divided into five approximately equal subsets. Four sets are used as training data sets to compute the likelihood ratios of the individual evidence. The remaining set is used as the test data set to count the number of predicted true positives (TP) and false positives (FP) where one protein pair is predicted to be positive if its likelihood ratio exceeds a particular cutoff, LRcutoff, and to be negative otherwise. This process is done in turn five times, and finally the number of TPs and FPs against different likelihood ratios across five test data sets are summed to calculate the TP/FP ratio and the sensitivity (TP/T) and specificity (1 − (FP/F)) for the ROC curve.
Six Types of Biological Evidences Can Be Used to Assess the Confidence of Protein Interactions—
We use the golden standard positive and negative data sets to measure the reliability of each biological evidence. Fig. 2 shows their likelihood ratios LR(f) for each biological evidence f. In theory, LR(f) > 1 indicates that biological evidence f has the ability to identify the true protein interactions from the HTPID. As seen in Fig. 2, all six biological evidences have LRs greater than 1, suggesting that all of them can be used to assess the confidence of the protein interactions. From Fig. 2, we can also find that there are great differences between the reliability of these six data types. Interacting domain, function annotation, network topology structure, and model protein-protein interactions have higher reliability, whereas the reliability of the gene expression and genome context is relatively lower. There are also great differences between different data sets of the same data types. Therefore, it is reasonable to take differences of the data sets into account when combining them for confidence assessments.
The Combined Likelihood Ratio Can Be Used to Measure the Reliability of a Protein-Protein Interaction—
Because the prior odds is a constant, the posterior odds is proportional to LR. Therefore LR can theoretically measure the reliability of an interaction (Equation 2). To test this speculation, during the 5-fold cross-validation against the golden standard data sets, we change the LRcutoff and plot the ratio of the true to false positive (TP/FP) as the function of the cutoff of likelihood ratio in Fig. 3. TP/FP acting as a measure to the accuracy of a test increases monotonically with the cutoff of likelihood ratio, confirming that the combined likelihood ratio can be used as an appropriate confidence score to measure the odds of a real interaction as well as the individual likelihood ratios. This is the fundament for various assessment models except Simple Voting Model described under “PRINCESS Has a Higher Sensitivity than Three Other Kinds of Assessment Models for Comparable Specificity.” A protein pair with an LR greater than 1 is supposed to be supported by at least one biological evidence. Of course, users can set a higher likelihood ratio threshold value to filter out the higher confidence interactions.
PRINCESS Has a Higher Sensitivity than Three Other Kinds of Assessment Models for Comparable Specificity—
To compare the efficacy of PRINCESS with those other assessment methods in the previous literature (
), we first construct several assessment models simulating those methods. For those single evidence methods, we establish the Single Evidence Models where the confidence of each protein interaction is assigned by likelihood ratio of the confidence bins of individual evidence. To simulate the method of Stelz et al. (
), the Simple Voting Model is established where every protein interaction is assigned a confidence score by the number of its supported biological evidences (this is determined by whether the individual likelihood ratio of the feature is greater than LRcutoff). Especially to compare PRINCESS with the Bayesian model of Patil and Nakamura (
), we use only three biological evidences, “Interolog,” “Interacting Domain,” and “GO Coannotation” to construct the “Three Evidence Model.”
We also use the 5-fold cross-validation protocol to evaluate the performance of PRINCESS. The resulting ROC curves are illustrated in Fig. 4. Each point on the ROC curve of each assessment model denotes the sensitivity and specificity obtained from one test against a particular LRcutoff. The area under the ROC curve (AUC) is an indicator of the efficacy of the assessment system. An ideal test with perfect discrimination (100% sensitivity and 100% specificity) has an AUC of 1.0, whereas a non-informative prediction has the area 0.5, indicating that it may be achieved by mere guess. The more the AUC of a test approximates 1.0, the higher the overall efficacy of the test will be. We find that our improved Bayesian model has an area approximating 0.9, suggesting that it has a relatively high ability to identify the true interactions against the test data sets.
Because the AUC is an indicator of the discriminatory power for the assessment system, here we also use it to compare the prediction efficacy of different assessment models. From Fig. 4, we notice that those Single Evidence Models have different AUC values in accord with the previous conclusion that there are great differences in their reliability and that the efficacy of these models is lower than that of multiple evidence models. The Simple Voting Model integrating multiple data sources also has relatively high efficacy. However, its efficacy is still lower than that of PRINCESS possibly because the Bayesian system considers the difference of biological evidences’ reliability. Especially, here we also compare the performance of PRINCESS with that of the Three Evidence Model (
). We find that the extra three features can improve assessment efficacy significantly (Fig. 4), although the evaluation ability of the incorporated individual gene coexpression and genome context evidence is relatively low (Fig. 4).
Improved Bayesian Model Can Predict True Interactions from Human High Throughput Data Sets—
The current version of PRINCESS is used mainly to predict true interactions from human high throughput protein interaction data sets. Authors of high throughput data sets usually assign a confidence level to interactions by experimental or bioinformatics approaches. Here we evaluate the reliability of two human HTPIDs by assigning LR values (
). Those high throughput interactions with LR > 2 are predicted as true interactions. We show the percentage of interactions predicted true across different data sets with multiple confidence. As can be seen from Fig. 5, the higher the confidence of the protein interaction data set in the literature is, the higher the percentage of interactions predicted as true will be, suggesting that PRINCESS has the ability to filter out the high confidence protein interactions from the HTPIDs.
PRINCESS Web Service—
The PRINCESS system is implemented under the UNIX environment with the PRINCESS data stored in the relational database for retrieval. Automated methods for searching and dynamically displaying the assessment result and the annotation are built with the combination of Perl (practical extraction and report language) and HTML (hypertext markup language). To improve the speed of analyses, parallel processing is used.
Using the PRINCESS Web service is a simple two-step process. In the first step, the user is asked to provide the properly formatted protein interaction data, which are pairs of gene or protein identifiers separated by comma. Now PRINCESS can accept two types of identifiers, the official gene symbols (
). Meanwhile the user may select their desired biological features and appropriate parameters. The PRINCESS Web application facilitates the query via a Web browser-based interface (Fig. 6A). In the second step, users will be presented with the table of analytical results. LRs will be given to illustrate the confidence of each interaction. Interactions with LRs greater than 1 are those that have one biological feature at least. The larger the LR value the protein interaction has, the more reliable it will be (Fig. 6B).
During the confidence assessment, each protein interaction will be annotated by multiple biological evidences. PRINCESS presents the detailed information via HTML pages (Fig. 6D). The hyperlink of each item in these pages will present abundant detailed information for the protein interactions. Because the figure views are more informative, PRINCESS presents the evaluated protein interaction and the linked interactions in figures, which can help users better understand the position of this protein interaction in the full human protein interaction network (Fig. 6C). In addition, PRINCESS can also illustrate some detailed information, such as the presentation of the three- or four-interaction loops and interacting domains, which are helpful for the user to understand the neighborhood and the structural basis for their query interactions (Fig. 6D). Here we use SVG (scalable vector graphics) language to program these figures because it allows additional functionalities such as zooming in without loss of resolution. Most importantly, SVG language can also link to the protein/interaction annotation molecule page by simply clicking its interactive elements.
Although biological discovery has benefited from large scale proteomics data, it is still a very big challenge to extract confident biological conclusions based on these data. One of the main reasons is the high proportion of “false positives” in these proteomics data (
). Analysis of these data often lacks their confidence information. In this study, we present a novel assessment system to evaluate the reliability of high throughput human protein interaction data. We first construct multiple biological evidences and use LR to measure their reliability, respectively. Then we use naïve Bayesian networks to combine the individual evidences for confidence assessment. This system is proved to have high sensitivity and good specificity by cross-validation. This system also has the ability to filter out true interactions from human HTPID.
Compared with previous assessment models, PRINCESS gives the best performance against the golden standard data sets. This advantage may result from the following points. 1) PRINCESS integrates more than one biological evidence; this can reduce the false positive and false negative derived from single evidence. 2) PRINCESS measures the reliability of each biological evidence by LR not by simple voting. 3) PRINCESS stratifies a data set into different confidence bins, improving its sensitivity to identify the true protein interaction. These improvements make PRINCESS more informative and predictable.
Besides the confidence assessment, another advantage of PRINCESS over other similar tools is that it can be used to annotate the protein interactions from multiple aspects. During experiment design for exploring the function of one protein, for example, it is often important to find the biological evidence for the potential interactions in which the protein participates. PRINCESS can find these supporting evidences for the candidate interactions. To facilitate this strategy by the community, we have developed our strategy into a professional protein-protein interaction confidence assessment and annotation Web service that supports on-line query with multiple options, network visualization, and detailed information presentation.
In PRINCESS, we integrated multiple heterogeneous data sources. There is the possibility that some of the supporting evidences for the interactions are derived from the same “wet lab” experiment that is just the source to generate the validated data set. During the 5-fold cross-validation, each time we regard these interactions without these evidences. Therefore, they will not exaggerate the performance of PRINCESS.
In addition, we have applied PRINCESS not only to human but also to yeast, and we achieved equal excellent performance (because of the page limitation, the details of the yeast results are presented in supplemental Fig. S3). However, we have to admit that PRINCESS, as a bioinformatic tool that heavily depends on the current biological data sets, might make some wrong decisions for those true interactions supported by few or even none of these six evidences because some less studied genes/proteins indeed may lack certain biological information nowadays, such as abundance, biological process, cellular localization, or interaction domain.
However, the Bayesian model used in PRINCESS permits us to integrate more and more efficient heterogeneous biological data. Actually we are now planning to integrate genetic interaction, phylogenetic distance, and the experimental data into PRINCESS. With more biological data sources (depending on biological technology development) and more types of evidences integrated into PRINCESS, she will achieve better performance, and her “false negative” will be gradually reduced.
We thank Christian von Mering for kindly supplying the STRING data sets; Songfeng Wu, Lei Dou, Hao Guo, Jianqi Li, and Lin Hou for fruitful discussions; and Dongsheng Li for hardware and software supports. We also thank two anonymous reviewers for helpful comments.
Published, MCP Papers in Press, January 29, 2008, DOI 10.1074/mcp.M700287-MCP200
This work was supported by the Chinese National Key Program of Basic Research (Grants 2006CB910800 and 2006CB910700), the National High Technology Research and Development Program of China (Grant 2006AA02A312), and National Natural Science Foundation of China (Grant 30621063). The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.