Proteome-wide Prediction of Signal Flow Direction in Protein Interaction Networks Based on Interacting Domains*

Signal flow direction is one of the most important features of the protein-protein interactions in signaling networks. However, almost all the outcomes of current high-throughout techniques for protein-protein interactions mapping are usually supposed to be non-directional. Based on the pairwise interaction domains, here we defined a novel parameter protein interaction directional score and then used it to predict the direction of signal flow between proteins in proteome-wide signaling networks. Using 5-fold cross-validation, our approach obtained a satisfied performance with the accuracy 89.79%, coverage 48.08%, and error ratio 16.91%. As an application, we established an integrated human directional protein interaction network, including 2,237 proteins and 5,530 interactions, and inferred a large amount of novel signaling pathways. Directional protein interaction network was strongly supported by the known signaling pathways literature (with the 87.5% accuracy) and further analyses on the biological annotation, subcellular localization, and network topology property. Thus, this study provided an effective method to define the upstream/downstream relations of interacting protein pairs and a powerful tool to unravel the unknown signaling pathways.

The development of high-throughput technologies has produced large scale protein interaction data for multiple species, and significant efforts have been made to analyze the data in order to establish the protein networks and to understand their functions (1)(2)(3)(4)(5)(6). In protein interaction networks, physical interactions are usually supposed to be non-directional. In fact, there widely exist regulation relationship and upstream/downstream relations between interacting proteins when they are involved in various networks of signal transduction, transcriptional regulation, cell cycle, or metabolism, etc.
Several groups have developed methods to infer signaling pathways based on protein interactions. Steffen et al. (7) presented an automated approach for modeling signal transduction networks in Saccharomyces cerevisiae by integrating protein interactions and gene expression data. Hautaniemi et al. (8) applied a decision tree approach to facilitate elucidation of signal-response cascade relations and generate experimentally testable predictions. Shlomi et al. (9) established a comprehensive framework, QPath, efficiently searched the network for homologous pathways. These evolutionarily conserved pathways provided clues to infer the upstream/downstream relations of protein interactions in unknown signaling pathways. Nguyen et al. (10) proposed a method of predicting signaling domain-domain interactions using inductive logic programming and discovering signal transduction networks in yeast. However, all the methods mentioned above focused on the automated generation of the signaling pathways using PPI and gene expression data. They could obtain a part of signaling pathways with a limited length and low accuracy only in the simple eukaryote yeast. Meanwhile, there was no efficient method to identify the direction of signal flows in pairwise interaction proteins.
Domains are elements of proteins in a sense of structure and function. Most proteins interact with each other through their domains. Therefore, it is crucial and useful to understand PPIs based on the domains (11,12). In this paper, we introduced a novel method to predict the direction of signal flows through protein pairs in signaling networks according to their constituent interacting domains. First, we defined a measure F to evaluate the direction of domain interactions and computed the F values of domain interactions using the training set of PPIs in multiple species. Then, we defined a novel parameter Protein Interaction Directional Score (PIDS) 1 to measure their directions. Furthermore, we evaluated the method using 5-fold cross-validation protocol. Finally, we applied it to infer novel signaling pathways from human proteome-wide interactions.

MATERIALS AND METHODS
High-confidence Domain Interaction Dataset-Total 6,163 highconfidence domain interactions were downloaded from DOMINE (13), including 4,349 interactions inferred from PDB entries and 3,143 interactions predicted by eight different computational approaches, using Pfam domain definitions. In this paper, these domain interactions were examined to discover their directions.
Protein Interaction Dataset in Signaling Networks-All the signaling networks of human, mouse, rat, fly, and yeast were downloaded from KEGG (14). There are 2,803 protein interactions involved in activation, inhibition, phosphorylation, dephosphorylation, and ubiquitination and 649 protein complexes. Protein domain information is based on the Pfam-A domains (15).
Integrated Human Protein Interaction Dataset-We obtained 45,238 non-redundant human protein interactions from HPRD (6), DIP (16), MINT (17), BIND (18) database, and the previous resources (19,20), which have corresponding Entrez Gene ID index and not reported in protein complex. These interactions were obtained by experiments and did not contain the prediction results, which composed the integrated human protein interaction dataset. The dataset is relatively credible and comprehensive to cover most of human protein interactions detected (21).
The Function F for the Direction Prediction of Domain Interactions-The enrichment of domain pairs was assessed with the domain enrichment ratio D (22), which is calculated as the probability (Pr) of a given pair of domains in a set of known interacting proteins divided by the product of the probabilities of each given domain pair independently. Based on D, we proposed a novel function F to indicate the direction of interacting domain pairs, which is defined by subtracting the backward domain enrichment ratio from the forward ratio, The Parameter PIDS for the Direction Prediction of Protein Interactions-Given two interacting proteins P i and P j , if signal transfers from P i to P j in signaling networks, then P ij Ͼ 0; otherwise P ij Ͻ 0 . On the basis of the F function in domain interactions, the parameter PIDS for the interaction from P i to P j is where d mn ⑀ P ij denotes that domains d m and d n belong to proteins P i and P j , respectively, and domain d m can interact with domain d n . N ij is the number of domain interactions between P i and P j , where F absolute values are bigger than zero. The threshold of PIDS as t, if PIDS ij Ͼ t and P ij Ͼ 0 or PIDS ij Ͻ Ϫt and P ij Ͻ 0, then the direction of P i to P j is correctly predicted.

Signaling Directions of Protein Interactions and Domain
Interactions-In the signaling networks, the direction of protein interactions is defined as the direction of signal flow between them. The interaction types we investigated include activation, inhibition, phosphorylation, dephosphorylation, and ubiquination, which are all direction-related. In human, rat, mouse, fly, and yeast, 76.40% proteins have one or more Pfam domains. Interaction between two proteins typically involves binding between specific domains. Thus, the identification of interacting domain pairs is an important step toward understanding protein interactions (23) and the direction of signal flow between them. Therefore, it is supposed that there exist upstream/downstream relations in domain interactions as those in protein interactions. Fig. 1, A and B illustrates the method of inferring the directions in domain interactions from the directions in protein interactions.
A Novel Function to Measure Signaling Direction of Domain Interaction-We defined a function F to evaluate the direction of domain interactions and applied it to discover the directions in high-confidence domain interaction dataset. Using 2,803 protein interactions with known directions as the positive training set, 364 domain interactions are found to be involved in the protein interactions of signaling networks. As a result, 286 domain pairs (78.57%) have positive or negative F values (supplemental Table S1). The distribution of their F absolute values is shown in Fig. 1C, with mean value 39.84 and standard error (S. E.) 7.39. They could provide clues about the upstream and downstream relation between a given protein pair in signaling networks. The domain interaction with the largest F is ubiquitin activating enzyme and ubiquitinconjugating enzyme, the F value of which is high up to 1,658.34.
A Novel Parameter to Predict the Signaling Direction of Protein Interactions Based on Domain Interactions-Based on F of domain interactions, we defined a parameter PIDS to measure the signal flow direction of their corresponding proteins' interaction. According to the domains of pairwise proteins, we computed their PIDS in protein interaction dataset of signaling networks (Fig. 1D). In principle, the protein interactions categorized into activation, inhibition, phosphorylation, dephosphorylation, and ubiquination are directional, while protein complexes, assumed to be non-directional, are used as controls.
Evaluation of the Method using 5-fold Cross-Validation-Using 5-fold cross-validation, we evaluated the performance of this method, in which protein pairs with known directions are used as the positive training set and protein complexes as the negative test set. When some protein pairs in the test set do not include the directional domain interactions, it is impossible to predict the signaling directions of those protein interactions with this method. Coverage is estimated by the number of test protein interactions divided by the number of protein pairs including domain pairs we investigated; and Accuracy, by the percentage of protein pairs, the directions of which are correctly predicted. Error ratio is defined as the percentage of protein pairs in the negative test set, which are falsely predicted with certain directions.
By selecting different threshold values for PIDS, we compared the accuracy, coverage, and error ratio in both the positive/negative test sets, as shown in Fig. 2A. When taking the threshold value as 2, accuracy, coverage, and error ratio are 89.79%, 48.08%, and 16.91%, respectively. While the threshold value goes up to 10, accuracy increases to 94.19% with the coverage 28.21%, and error ratio descends to 1.82%. With the increasing of the threshold value, it could provide higher accuracy and lower error ratio at the cost of smaller predicting capacity. In the practice, users could choose different threshold values of PIDS to meet their own requirements.

Performance of Cross-Species Prediction for Directional
Protein-Protein Interactions-Furthermore, we compared the accuracy and coverage across different species to evaluate the performance of this method, as shown in Fig. 2 (B and  C). Since the five species share major Pfam domains, it is feasible to predict the signal flow of PPIs in one species according to the directional PPIs in other species. 84.06% domains in human, 89.36% domains in mouse, 89.30% domains in rat, 82.07% domains in fly, and 64.76% domains in yeast proteins can be found in other four species. Taking the protein interactions of one species as the test set and all the other species as the training set, we identified the set of directional domain-domain interactions based on the training sets and used these domain-domain interactions to predict the directional protein-protein interactions in test dataset. We compared the predicting accuracy and coverage among five species, including human, mouse, rat, fly, and yeast. In conclusion, the method gained better performance in more evolutionarily advanced species. By taking threshold value 2, we achieved accuracy 95.23% and coverage 49.54% in human test set.

Proteome-wide Prediction of Signal Flow Directions in Human Protein Interaction
Network-As an application, we used this method to comprehensively predict signal flow directions of proteome-wide protein interactions in the integrated human protein interaction dataset (see under "Materials and Methods"). In the 45,238 integrated human protein interactions, 5,530 protein interactions are predicted with directions (supplemental Table S2), with the threshold value of PIDS 2. Of them, 424 (7.67%) predicted directional protein interactions are reported in the known signaling pathway databases. The predicted directions of 87.5% (371/424) protein interactions accord with those from KEGG (14), BioCarta, or NCI-Nature_curated database, indicating again that the method is of high accuracy. Obviously, the firstly predicted 5,106 directional protein interactions should be a valuable resource (Fig. 3A).
As a result, we established the first predicted human directional protein interaction network (DPIN) including 2,237 proteins and 5,530 interactions, with PIDS 19.97 Ϯ 0.57 (S. E.) (Fig. 3B). We found that the distribution of PIDS in human PPI training set and DPIN is similar (Fig. 3C). Furthermore, we characterized DPIN from the views of biological function, subcellular localization, and topology property as follows.
According to the interaction detection methods, there are 22,278 protein interactions in the integrated human protein interaction dataset detected only by one method, such as coimmunoprecipitation, tandem affinity purification, and two hybrid (Table IA). Compared the PPIs detected in vivo with those in vitro, we found 11.30% PPIs in vivo and 15.14% PPIs in vitro were directional. Among these methods, the PPIs from protein array method are of directional with the highest ratio up to 27.03%; whereas those from coimmunoprecipitation method are the lowest with 3.41%, implying that the coimmunoprecipitation method is relatively weak in detecting directional protein interactions in signaling networks.
Functional Annotation for the Directional Human Protein Interactions-Using Gene Ontology (24) protein interaction dataset, of which 20.9% proteins (2,120/ 10,146) are annotated as signal transduction, 46.0% proteins (1,029/2,237) in the DPIN are classified as signal transduction, demonstrating the significant enrichment of signal transducers in DPIN (p ϭ 1.67 ϫ 10 Ϫ212 ) and then the powerful capacity of our predicting method.
Subcellular Localization for Directional Human Protein Interactions-Using PA-SUB (25), we marked the subcellular localization of proteins in the human DPIN. Table IB indicates in no doubt that the predicted directions of protein interactions are mostly from the exterior to the nucleus of cells, i.e. the protein interactions with the direction from extracellular to cytoplasm are significantly more than those from cytoplasm to extracellular, and these from cytoplasm to nucleus more than those from nucleus to cytoplasm. Obviously, the global patterns of signal flow of protein interactions predicted by our method are perfectly in accord with the general law of signal pathways from the view of the subcellular localization, i.e. along the way from outside to inside of cells.
In addition, we paid attention to the protein interactions through which signal flows reversely from inside to outside of cells, including those from plasma membrane to extracellular, cytoplasm to plasma membrane, cytoplasm to extracellular, and nucleus to cytoplasm. Totally, we found 390 such protein interactions, with PIDS mean 16.35 Ϯ 1.43 (S. E.) (supplemental Table S3). These protein interactions might play roles in feedback regulation of signal transduction.
Topology Property of the Human DPIN-Using MFinder 1.2 (26), we computed the topological motifs of the human DPIN and found that a large amount of 3-node and 4-node motifs are significantly enriched in this network (Table IC). Intriguingly, most of the significantly abundant motifs are feedforward loops rather than feedback ones. These feed-forward loops have been reported to be widely present in the signaling networks, but absent in transcription networks (27), and could combine to form multi-layer perceptron motifs that are composed of three or more layers of signaling proteins (28). Also, such patterns can potentially carry out elaborate functions on multiple input signals and show graceful degradation of performance upon loss of components (29,30).
Novel Signaling Pathways Inferred from the Human DPIN-From the human DPIN, we inferred a large amount of novel signaling pathways. By defining the extracellular proteins as the input, which could only deliver to but not accept signal from other proteins and those nuclear proteins as the output, which could only accept from but not deliver to any other protein, we found 292 input proteins and 219 output proteins. Then we searched all the possible pathways from the input layer to the output and generated 973,628 pathways with PIDS mean 14.66 Ϯ 0.01 (S. E.) and average path length 8.61, which highlighted by the pathway from TNFRSF21 to UBB with the highest PIDS mean 122.63. The number of pathways is very large in that signaling networks have the character of multi-pathways, and only a part of pathways are verified from the biological point. We paid special attention to the shortest pathways between the input and output. As a result, we found 1,457 novel signaling pathways (supplemental Table S4) and

Proteome-wide Prediction of Signal Flow Direction
presented top 10 predicted pathways (Fig. 4). Meanwhile, we compared the average PIDS and path length of shortest pathways and found the significant negative correlation between them with Spearman correlation coefficient Ϫ0.315 (p ϭ 10 Ϫ6 ), suggesting that the shorter pathways tend to have stronger signal flow directions.

DISCUSSION
The direction of protein interactions is the prerequisite of forming various signaling networks. We proposed a method to infer the signaling directions between protein interactions based on their constitutive domains. Compared with the previous researches (7-10), our method focused on the prediction of the direction between pairwise interacting proteins, which is easier to be evaluated. Especially, this method could be applied to predict signal flow direction in proteome-wide protein interactions and provide a global directional annotation of the protein interaction network. The method we proposed is powerful not only in defining unknown direction of protein interactions, but also in providing comprehensive insight into the signaling networks.
The method was successfully applied to establish a novel human DPIN, which was strongly supported by the highly accurate prediction of known signaling pathways and further analysis on the biological annotation, subcellular localization, and topology property. The predicted directional proteins are significantly enriched in signal transduction, and the global directions of protein interactions accord with the general laws in the signaling networks. Based on the evident DPIN, we uncovered several very interesting features of directional protein interaction networks as follows: the direction of signal flow based on protein interactions goes frequently along the way from the outside to inside of the cells; feed-forward loops more widely exist than the feedback loops; the shorter pathways tend to have stronger signal flow directions. Of course, these conclusions are drawn based on the incomplete human dataset. Although these conclusions may be biased, so far there is still no complete protein interaction network. Our conclusions based on the current dataset can imply the topology property of human protein-protein interaction networks. With more protein interactions and domain interactions are discovered; our method can be applied to find more signaling pathways and further validate the features of signaling networks reported above. FIG. 4. Top 10 pathways inferred from the human DPIN. The proteins located in extracellular are marked with diamond, those in cytoplasm ellipse, and those in nucleus triangle. The protein interactions of 10 Ͼ ͉PIDS͉ Ն 2 are marked with blue and one line widths, those of ͉PIDS͉ Ն 10 with red and two widths, and known PPIs in signaling databases with black and two widths.