Advertisement

PRINCESS, a Protein Interaction Confidence Evaluation System with Multiple Data Sources

  • Author Footnotes
    ‡ These authors contributed equally to this work.
    Dong Li
    Footnotes
    ‡ These authors contributed equally to this work.
    Affiliations
    The State Key Laboratory of Proteomics, Beijing Proteome Research Center, Beijing Institute of Radiation Medicine, 100850 Beijing, China
    Search for articles by this author
  • Author Footnotes
    ‡ These authors contributed equally to this work.
    Wanlin Liu
    Footnotes
    ‡ These authors contributed equally to this work.
    Affiliations
    The State Key Laboratory of Proteomics, Beijing Proteome Research Center, Beijing Institute of Radiation Medicine, 100850 Beijing, China
    Search for articles by this author
  • Author Footnotes
    ‡ These authors contributed equally to this work.
    Zhongyang Liu
    Footnotes
    ‡ These authors contributed equally to this work.
    Affiliations
    The State Key Laboratory of Proteomics, Beijing Proteome Research Center, Beijing Institute of Radiation Medicine, 100850 Beijing, China
    Search for articles by this author
  • Jian Wang
    Affiliations
    The State Key Laboratory of Proteomics, Beijing Proteome Research Center, Beijing Institute of Radiation Medicine, 100850 Beijing, China
    Search for articles by this author
  • Qijun Liu
    Affiliations
    The State Key Laboratory of Proteomics, Beijing Proteome Research Center, Beijing Institute of Radiation Medicine, 100850 Beijing, China
    Search for articles by this author
  • Yunping Zhu
    Correspondence
    To whom correspondence may be addressed. Tel.: 86-10-80705999; Fax: 86-10-80705155
    Affiliations
    The State Key Laboratory of Proteomics, Beijing Proteome Research Center, Beijing Institute of Radiation Medicine, 100850 Beijing, China
    Search for articles by this author
  • Fuchu He
    Correspondence
    To whom correspondence may be addressed. Tel.: 86-10-68171208; Fax: 86-10-68214653
    Affiliations
    The State Key Laboratory of Proteomics, Beijing Proteome Research Center, Beijing Institute of Radiation Medicine, 100850 Beijing, China
    Search for articles by this author
  • Author Footnotes
    ‡ These authors contributed equally to this work.
      Advances in proteomics technologies have enabled novel protein interactions to be detected at high speed, but they come at the expense of relatively low quality. Therefore, a crucial step in utilizing the high throughput protein interaction data is evaluating their confidence and then separating the subsets of reliable interactions from the background noise for further analyses. Using Bayesian network approaches, we combine multiple heterogeneous biological evidences, including model organism protein-protein interaction, interaction domain, functional annotation, gene expression, genome context, and network topology structure, to assign reliability to the human protein-protein interactions identified by high throughput experiments. This method shows high sensitivity and specificity to predict true interactions from the human high throughput protein-protein interaction data sets. This method has been developed into an on-line confidence scoring system specifically for the human high throughput protein-protein interactions. Users may submit their protein-protein interaction data on line, and the detailed information about the supporting evidence for query interactions together with the confidence scores will be returned. The Web interface of PRINCESS (protein interaction confidence evaluation system with multiple data sources) is available at the website of China Human Proteome Organisation.
      Protein-protein interactions play important roles in defining most cellular functions (
      • Hartwell L.H.
      • Hopfield J.J.
      • Leibler S.
      • Murray A.W.
      From molecular to modular cell biology.
      ,
      • Bray D.
      Molecular networks: the top-down view.
      ). Traditionally protein interactions are studied individually by top-down, hypothesis-driven approaches with experiments designed to derive high quality detailed interaction information. Recently advances in proteomics technologies have enabled a large number of novel protein interactions to be detected at an unexpected speed by yeast two-hybrid screens (
      • Uetz P.
      • Giot L.
      • Cagney G.
      • Mansfield T.A.
      • Judson R.S.
      • Knight J.R.
      • Lockshon D.
      • Narayan V.
      • Srinivasan M.
      • Pochart P.
      • Qureshi-Emili A.
      • Li Y.
      • Godwin B.
      • Conover D.
      • Kalbfleisch T.
      • Vijayadamodar G.
      • Yang M.
      • Johnston M.
      • Fields S.
      • Rothberg J.M.
      A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae.
      ,
      • Ito T.
      • Chiba T.
      • Ozawa R.
      • Yoshida M.
      • Hattori M.
      • Sakaki Y.
      A comprehensive two-hybrid analysis to explore the yeast protein interactome.
      ,
      • Giot L.
      • Bader J.S.
      • Brouwer C.
      • Chaudhuri A.
      • Kuang B.
      • Li Y.
      • Hao Y.L.
      • Ooi C.E.
      • Godwin B.
      • Vitols E.
      • Vijayadamodar G.
      • Pochart P.
      • Machineni H.
      • Welsh M.
      • Kong Y.
      • Zerhusen B.
      • Malcolm R.
      • Varrone Z.
      • Collis A.
      • Minto M.
      • Burgess S.
      • McDaniel L.
      • Stimpson E.
      • Spriggs F.
      • Williams J.
      • Neurath K.
      • Ioime N.
      • Agee M.
      • Voss E.
      • Furtak K.
      • Renzulli R.
      • Aanensen N.
      • Carrolla S.
      • Bickelhaupt E.
      • Lazovatsky Y.
      • DaSilva A.
      • Zhong J.
      • Stanyon C.A.
      • Finley Jr., R.L.
      • White K.P.
      • Braverman M.
      • Jarvie T.
      • Gold S.
      • Leach M.
      • Knight J.
      • Shimkets R.A.
      • McKenna M.P.
      • Chant J.
      • Rothberg J.M.
      A protein interaction map of Drosophila melanogaster.
      ,
      • Li S.
      • Armstrong C.M.
      • Bertin N.
      • Ge H.
      • Milstein S.
      • Boxem M.
      • Vidalain P.O.
      • Han J.D.
      • Chesneau A.
      • Hao T.
      • Goldberg D.S.
      • Li N.
      • Martinez M.
      • Rual J.F.
      • Lamesch P.
      • Xu L.
      • Tewari M.
      • Wong S.L.
      • Zhang L.V.
      • Berriz G.F.
      • Jacotot L.
      • Vaglio P.
      • Reboul J.
      • Hirozane-Kishikawa T.
      • Li Q.
      • Gabel H.W.
      • Elewa A.
      • Baumgartner B.
      • Rose D.J.
      • Yu H.
      • Bosak S.
      • Sequerra R.
      • Fraser A.
      • Mango S.E.
      • Saxton W.M.
      • Strome S.
      • Van Den Heuvel S.
      • Piano F.
      • Vandenhaute J.
      • Sardet C.
      • Gerstein M.
      • Doucette-Stamm L.
      • Gunsalus K.C.
      • Harper J.W.
      • Cusick M.E.
      • Roth F.P.
      • Hill D.E.
      • Vidal M.
      A map of the interactome network of the metazoan C. elegans.
      ,
      • Stelzl U.
      • Worm U.
      • Lalowski M.
      • Haenig C.
      • Brembeck F.H.
      • Goehler H.
      • Stroedicke M.
      • Zenkner M.
      • Schoenherr A.
      • Koeppen S.
      • Timm J.
      • Mintzlaff S.
      • Abraham C.
      • Bock N.
      • Kietzmann S.
      • Goedde A.
      • Toksoz E.
      • Droege A.
      • Krobitsch S.
      • Korn B.
      • Birchmeier W.
      • Lehrach H.
      • Wanker E.E.
      A human protein-protein interaction network: a resource for annotating the proteome.
      ,
      • Rual J.F.
      • Venkatesan K.
      • Hao T.
      • Hirozane-Kishikawa T.
      • Dricot A.
      • Li N.
      • Berriz G.F.
      • Gibbons F.D.
      • Dreze M.
      • Ayivi-Guedehoussou N.
      • Klitgord N.
      • Simon C.
      • Boxem M.
      • Milstein S.
      • Rosenberg J.
      • Goldberg D.S.
      • Zhang L.V.
      • Wong S.L.
      • Franklin G.
      • Li S.
      • Albala J.S.
      • Lim J.
      • Fraughton C.
      • Llamosas E.
      • Cevik S.
      • Bex C.
      • Lamesch P.
      • Sikorski R.S.
      • Vandenhaute J.
      • Zoghbi H.Y.
      • Smolyar A.
      • Bosak S.
      • Sequerra R.
      • Doucette-Stamm L.
      • Cusick M.E.
      • Hill D.E.
      • Roth F.P.
      • Vidal M.
      Towards a proteome-scale map of the human protein-protein interaction network.
      ) and tandem affinity purification (
      • Ho Y.
      • Gruhler A.
      • Heilbut A.
      • Bader G.D.
      • Moore L.
      • Adams S.L.
      • Millar A.
      • Taylor P.
      • Bennett K.
      • Boutilier K.
      • Yang L.
      • Wolting C.
      • Donaldson I.
      • Schandorff S.
      • Shewnarane J.
      • Vo M.
      • Taggart J.
      • Goudreault M.
      • Muskat B.
      • Alfarano C.
      • Dewar D.
      • Lin Z.
      • Michalickova K.
      • Willems A.R.
      • Sassi H.
      • Nielsen P.A.
      • Rasmussen K.J.
      • Andersen J.R.
      • Johansen L.E.
      • Hansen L.H.
      • Jespersen H.
      • Podtelejnikov A.
      • Nielsen E.
      • Crawford J.
      • Poulsen V.
      • Sorensen B.D.
      • Matthiesen J.
      • Hendrickson R.C.
      • Gleeson F.
      • Pawson T.
      • Moran M.F.
      • Durocher D.
      • Mann M.
      • Hogue C.W.
      • Figeys D.
      • Tyers M.
      Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry.
      ,
      • Gavin A.C.
      • Bosche M.
      • Krause R.
      • Grandi P.
      • Marzioch M.
      • Bauer A.
      • Schultz J.
      • Rick J.M.
      • Michon A.M.
      • Cruciat C.M.
      • Remor M.
      • Hofert C.
      • Schelder M.
      • Brajenovic M.
      • Ruffner H.
      • Merino A.
      • Klein K.
      • Hudak M.
      • Dickson D.
      • Rudi T.
      • Gnau V.
      • Bauch A.
      • Bastuck S.
      • Huhse B.
      • Leutwein C.
      • Heurtier M.A.
      • Copley R.R.
      • Edelmann A.
      • Querfurth E.
      • Rybin V.
      • Drewes G.
      • Raida M.
      • Bouwmeester T.
      • Bork P.
      • Seraphin B.
      • Kuster B.
      • Neubauer G.
      • Superti-Furga G.
      Functional organization of the yeast proteome by systematic analysis of protein complexes.
      ). Compared with the traditional approaches, high throughput approaches always result in potentially erroneous data sets. For example, von Mering et al. (
      • von Mering C.
      • Krause R.
      • Snel B.
      • Cornell M.
      • Oliver S.G.
      • Fields S.
      • Bork P.
      Comparative assessment of large-scale data sets of protein-protein interactions.
      ) estimated that approximately half of the interactions obtained from high throughput experiments might be false positives. These false positives may connect the unrelated proteins, complicating and even misleading the elucidation of biological significance (
      • Lin N.
      • Zhao H.
      Are scale-free networks robust to measurement errors?.
      ,
      • Han J.D.
      • Dupuy D.
      • Bertin N.
      • Cusick M.E.
      • Vidal M.
      Effect of sampling on topology predictions of protein-protein interaction networks.
      ). Therefore, a crucial step in analyzing a high throughput protein interaction data set (HTPID)
      The abbreviations used are: HTPID, high throughput protein interaction data set; AUC, area under the ROC curve; DDI, domain-domain interaction; FP, false positive; GO, Gene Ontology; LR, likelihood ratio; ROC, receiver operating characteristic; SSBP, smallest shared biological process; STRING, search tool for the retrieval of interacting genes/proteins; TP, true positive.
      1The abbreviations used are: HTPID, high throughput protein interaction data set; AUC, area under the ROC curve; DDI, domain-domain interaction; FP, false positive; GO, Gene Ontology; LR, likelihood ratio; ROC, receiver operating characteristic; SSBP, smallest shared biological process; STRING, search tool for the retrieval of interacting genes/proteins; TP, true positive.
      is evaluating the reliability of the interactions and then separating the subset of credible interactions from background noise.
      Several methods have been developed previously to predict the true protein interactions from the high throughput protein interaction data sets, such as data set intersection (
      • von Mering C.
      • Krause R.
      • Snel B.
      • Cornell M.
      • Oliver S.G.
      • Fields S.
      • Bork P.
      Comparative assessment of large-scale data sets of protein-protein interactions.
      ,
      • Han J.D.
      • Bertin N.
      • Hao T.
      • Goldberg D.S.
      • Berriz G.F.
      • Zhang L.V.
      • Dupuy D.
      • Walhout A.J.
      • Cusick M.E.
      • Roth F.P.
      • Vidal M.
      Evidence for dynamically organized modularity in the yeast protein-protein interaction network.
      ), homologous interaction (
      • Deane C.M.
      • Salwinski L.
      • Xenarios I.
      • Eisenberg D.
      Protein interactions: two methods for assessment of the reliability of high throughput observations.
      ,
      • Matthews L.R.
      • Vaglio P.
      • Reboul J.
      • Ge H.
      • Davis B.P.
      • Garrels J.
      • Vincent S.
      • Vidal M.
      Identification of potential interaction networks using sequence-based searches for conserved protein-protein interactions or “interologs”.
      ), interacting domains (
      • Ng S.K.
      • Zhang Z.
      • Tan S.H.
      • Lin K.
      InterDom: a database of putative interacting protein domains for validating predicted protein interactions and complexes.
      ), functional similarity (3, 18), gene coexpression (
      • Kemmeren P.
      • van Berkum N.L.
      • Vilo J.
      • Bijma T.
      • Donders R.
      • Brazma A.
      • Holstege F.C.
      Protein interaction verification and functional annotation by integrated analysis of genome-scale data.
      ,
      • Hahn A.
      • Rahnenfuührer J.
      • Talwar P.
      • Lengauer T.
      Confirmation of human protein interaction data by human expression data.
      ), and protein interaction network topology (
      • Goldberg D.S.
      • Roth F.P.
      Assessing experimentally derived interactions in small world.
      ,
      • Saito R.
      • Suzuki H.
      • Hayashizaki Y.
      Interaction generality, a measurement to assess the reliability of a protein-protein interaction.
      ,
      • Bader J.S.
      • Chaudhuri A.
      • Rothberg J.M.
      • Chant J.
      Gaining confidence in high-throughput protein interaction networks.
      ). Most of these methods are based on a single biological evidence. Although these “Single Evidence Models” have been proved to be of certain efficacy, none of them gain both a high specificity and a good sensitivity at the same time (
      • Deane C.M.
      • Salwinski L.
      • Xenarios I.
      • Eisenberg D.
      Protein interactions: two methods for assessment of the reliability of high throughput observations.
      ,
      • Matthews L.R.
      • Vaglio P.
      • Reboul J.
      • Ge H.
      • Davis B.P.
      • Garrels J.
      • Vincent S.
      • Vidal M.
      Identification of potential interaction networks using sequence-based searches for conserved protein-protein interactions or “interologs”.
      ,
      • Ng S.K.
      • Zhang Z.
      • Tan S.H.
      • Lin K.
      InterDom: a database of putative interacting protein domains for validating predicted protein interactions and complexes.
      ,
      • Lehner B.
      • Fraser A.G.
      A first-draft human protein-interaction map.
      ,
      • Kemmeren P.
      • van Berkum N.L.
      • Vilo J.
      • Bijma T.
      • Donders R.
      • Brazma A.
      • Holstege F.C.
      Protein interaction verification and functional annotation by integrated analysis of genome-scale data.
      ,
      • Hahn A.
      • Rahnenfuührer J.
      • Talwar P.
      • Lengauer T.
      Confirmation of human protein interaction data by human expression data.
      ,
      • Goldberg D.S.
      • Roth F.P.
      Assessing experimentally derived interactions in small world.
      ,
      • Saito R.
      • Suzuki H.
      • Hayashizaki Y.
      Interaction generality, a measurement to assess the reliability of a protein-protein interaction.
      ,
      • Bader J.S.
      • Chaudhuri A.
      • Rothberg J.M.
      • Chant J.
      Gaining confidence in high-throughput protein interaction networks.
      ). To reduce the intrinsic false positives and false negatives from a single source, in recent years researchers have tended to integrate multiple data sources. Using experimental, topological, and Gene Ontology (
      • Ashburner M.
      • Ball C.A.
      • Blake J.A.
      • Botstein D.
      • Butler H.
      • Cherry J.M.
      • Davis A.P.
      • Dolinski K.
      • Dwight S.S.
      • Eppig J.T.
      • Harris M.A.
      • Hill D.P.
      • Issel-Tarver L.
      • Kasarskis A.
      • Lewis S.
      • Matese J.C.
      • Richardson J.E.
      • Ringwald M.
      • Rubin G.M.
      • Sherlock G.
      Gene Ontology: tool for the unification of biology.
      ) information, Stelz et al. (
      • Stelzl U.
      • Worm U.
      • Lalowski M.
      • Haenig C.
      • Brembeck F.H.
      • Goehler H.
      • Stroedicke M.
      • Zenkner M.
      • Schoenherr A.
      • Koeppen S.
      • Timm J.
      • Mintzlaff S.
      • Abraham C.
      • Bock N.
      • Kietzmann S.
      • Goedde A.
      • Toksoz E.
      • Droege A.
      • Krobitsch S.
      • Korn B.
      • Birchmeier W.
      • Lehrach H.
      • Wanker E.E.
      A human protein-protein interaction network: a resource for annotating the proteome.
      ) established six criteria to evaluate their human yeast two-hybrid data. An interaction is awarded one quality point for each fulfilled criterion and then ranked according to their number of quality points. As this process is very like the voting procedure, we refer to this scoring model as the “Simple Voting Model.” There are some limitations in this voting procedure. Biological evidences have to be transformed into a binary format by setting the cutoff value, which naturally involves a loss of information. And because different cutoffs can change the results of the voting procedure, it is often very difficult to set a proper cutoff.
      To eliminate these limitations in the Simple Voting Model, here we introduce a Bayesian approach to integrate multiple biological evidences for the confidence scoring system. The Bayesian approach is a probability-based derivation method, which is suitable for combining evidences from multiple heterogeneous biological features, especially robust on incomplete and uncertain data (
      • Eddy S.R.
      What is Bayesian statistics?.
      ). The Bayesian approach has been used frequently to predict protein-protein interactions (
      • Jansen R.
      • Yu H.
      • Greenbaum D.
      • Kluger Y.
      • Krogan N.J.
      • Chung S.
      • Emili A.
      • Snyder M.
      • Greenblatt J.F.
      • Gerstein M.
      A Bayesian networks approach for predicting protein-protein interactions from genomic data.
      ,
      • Rhodes D.R.
      • Tomlins S.A.
      • Varambally S.
      • Mahavisno V.
      • Barrette T.
      • Kalyana-Sundaram S.
      • Ghosh D.
      • Pandey A.
      • Chinnaiyan A.M.
      Probabilistic model of the human protein-protein interaction network.
      ,
      • Xia K.
      • Dong D.
      • Han J.D.
      IntNetDB v1.0: an integrated protein-protein interaction network database generated by a probabilistic model.
      ) and has also been used to assess the confidence of HTPID. Using a Bayesian network, Patil and Nakamura (
      • Patil A.
      • Nakamura H.
      Filtering high-throughput protein-protein interaction data using a combination of genomic features.
      ) integrated sequence homology, function similarity, and interacting domains to evaluate the reliability of the yeast high throughput protein interaction data and gained high sensitivity and specificity. However, their strategy did not perform well on the human protein interaction data (
      • Patil A.
      • Nakamura H.
      Filtering high-throughput protein-protein interaction data using a combination of genomic features.
      ). To evaluate the human HTPID efficiently, in this study we improve the model of Patil and Nakamura (
      • Patil A.
      • Nakamura H.
      Filtering high-throughput protein-protein interaction data using a combination of genomic features.
      ) in several aspects. 1) We integrate three more genomic features: gene coexpression, genome context, and network topology. Although two of these features, gene coexpression and network topology, have been used as the individual evidence for confidence assessment (
      • Kemmeren P.
      • van Berkum N.L.
      • Vilo J.
      • Bijma T.
      • Donders R.
      • Brazma A.
      • Holstege F.C.
      Protein interaction verification and functional annotation by integrated analysis of genome-scale data.
      ,
      • Hahn A.
      • Rahnenfuührer J.
      • Talwar P.
      • Lengauer T.
      Confirmation of human protein interaction data by human expression data.
      ,
      • Goldberg D.S.
      • Roth F.P.
      Assessing experimentally derived interactions in small world.
      ,
      • Saito R.
      • Suzuki H.
      • Hayashizaki Y.
      Interaction generality, a measurement to assess the reliability of a protein-protein interaction.
      ,
      • Bader J.S.
      • Chaudhuri A.
      • Rothberg J.M.
      • Chant J.
      Gaining confidence in high-throughput protein interaction networks.
      ), none of them have been integrated into a Bayesian model for this purpose. 2) We use 16,301 human protein interactions as the golden standard positive data sets, whereas Patil and Nakamura (
      • Patil A.
      • Nakamura H.
      Filtering high-throughput protein-protein interaction data using a combination of genomic features.
      ) only use 1,479 yeast interactions in their analyses. 3) We stratify the same biological evidence into different confidence bins and then use likelihood ratios (LRs) to measure the reliability of these bins. These improvements are supposed to increase the sensitivity and specificity of the Bayesian scoring system. Lastly to facilitate use by the community, we developed this strategy into a Web service, namely PRINCESS (protein interaction confidence evaluation system with multiple data sources). PRINCESS is designed to address the requirements not only for confidence assessment of high throughput interactions but also for their biological annotation with multiple biological evidences.

      EXPERIMENTAL PROCEDURES

      The main strategy of PRINCESS is to use likelihood ratios to assess the reliability of individual biological evidences based on golden standard data sets and then to combine these individual likelihood ratios by a Bayesian model to assign confidence scores to the high throughput protein interactions (Fig. 1).
      Figure thumbnail gr1
      Fig. 1A schematic diagram of PRINCESS. PRINCESS first uses LRs to assess the reliability of biological evidences, then assigns LRs to the query protein interaction by its supporting evidences, and finally combines these LRs in a naïve Bayesian network to generate the composite likelihood ratio (LRcomp) for confidence assessment ().

       Golden Standard Data Sets

      To estimate the LR of each evidence, the golden standard positive and negative data sets are constructed. Protein interaction data in the Human Protein Reference Database are all obtained by the critical literature reading (
      • Peri S.
      • Navarro J.D.
      • Kristiansen T.Z.
      • Amanchy R.
      • Surendranath V.
      • Muthusamy B.
      • Gandhi T.K.
      • Chandrika K.N.
      • Deshpande N.
      • Suresh S.
      • Rashmi B.P.
      • Shanker K.
      • Padma N.
      • Niranjan V.
      • Harsha H.C.
      • Talreja N.
      • Vrushabendra B.M.
      • Ramya M.A.
      • Yatish A.J.
      • Joy M.
      • Shivashankar H.N.
      • Kavitha M.P.
      • Menezes M.
      • Choudhury D.R.
      • Ghosh N.
      • Saravana R.
      • Chandran S.
      • Mohan S.
      • Jonnalagadda C.K.
      • Prasad C.K.
      • Kumar-Sinha C.
      • Deshpande K.S.
      • Pandey A.
      Human protein reference database as a discovery resource for proteomics.
      ); therefore, we download the protein interaction data deposited in the Human Protein Reference Database (released September 13, 2005) as the golden standard positive. It is difficult to find an experimental negative data set. Here we construct a golden standard negative data set consisting of the interactions between 2,110 nuclear proteins and 1,021 plasma membrane proteins obtained from the Gene Ontology consortium (
      • Ashburner M.
      • Ball C.A.
      • Blake J.A.
      • Botstein D.
      • Butler H.
      • Cherry J.M.
      • Davis A.P.
      • Dolinski K.
      • Dwight S.S.
      • Eppig J.T.
      • Harris M.A.
      • Hill D.P.
      • Issel-Tarver L.
      • Kasarskis A.
      • Lewis S.
      • Matese J.C.
      • Richardson J.E.
      • Ringwald M.
      • Rubin G.M.
      • Sherlock G.
      Gene Ontology: tool for the unification of biology.
      ). The protein pairs in the golden standard negative have a lower probability to interact than those in a random network.

       Construction of Multiple Types of Biological Evidences

      In the current version of PRINCESS, we integrate six types of data sources. They are listed below.

       Model Organism Protein-Protein Interaction Data—

      Protein-protein interactions are frequently conserved across multiple organisms (
      • Pagel P.
      • Mewes H.W.
      • Frishman D.
      Conservation of protein-protein interactions—lessons from ascomycota.
      ,
      • Kelley B.P.
      • Sharan R.
      • Karp R.M.
      • Sittler T.
      • Root D.E.
      • Stockwell B.R.
      • Ideker T.
      Conserved pathways within bacteria and yeast as revealed by global protein network alignment.
      ); therefore if the orthologs of a pair of interacting human proteins can interact in another model organism, the interaction will be regarded as high confidence. Here we download the model organism protein interaction data sets from Database of Interacting Proteins (
      • Salwinski L.
      • Miller C.S.
      • Smith A.J.
      • Pettit F.K.
      • Bowie J.U.
      • Eisenberg D.
      The Database of Interacting Proteins: 2004 update.
      ) and then map them from these different model organisms to the human InParanoid database (
      • O'Brien K.P.
      • Remm M.
      • Sonnhammer E.L.
      Inparanoid: a comprehensive database of eukaryotic orthologs.
      ). For these model organism protein interaction data, there are many variables that might correlate with their reliability. Therefore, we use a J48 pruned tree to stratify them into various confidence bins as implemented in the Weka software package (
      • Witten I.H.
      • Frank E.
      Data Mining: Practical Machine Learning Techniques with Java Implementations.
      ) (supplemental Fig. S1). A J48 pruned tree is used here because it can generate simple classification rules, which are proper for the construction of the assessment system (
      • Frank E.
      • Hall M.A.
      • Holmes G.
      • Kirkby R.
      • Pfahringer B.
      • Witten I.H.
      • Trigg L.
      Weka.
      ). We evaluate the confidence of each confidence bin by LR (Fig. 2A). The results suggest that the orthologous interactions (“interologs”) generally have a strong ability for confidence assessment.
      Figure thumbnail gr2
      Fig. 2Diverse types of biological evidences contributing to the reliable evaluation.A, model organism protein-protein interaction. The yeast (Sce), worm (Cel), and fly (Dme) protein interaction data sets are downloaded from Database of Interacting Proteins (
      • Salwinski L.
      • Miller C.S.
      • Smith A.J.
      • Pettit F.K.
      • Bowie J.U.
      • Eisenberg D.
      The Database of Interacting Proteins: 2004 update.
      ), and these model organism proteins are mapped to human orthologs using the InParanoid database (
      • O'Brien K.P.
      • Remm M.
      • Sonnhammer E.L.
      Inparanoid: a comprehensive database of eukaryotic orthologs.
      ). A J48 pruned decision tree algorithm is used to classify these model organism interaction data sets into various confidence bins (see supplemental Fig. S1 for details of bins). BD, interaction domain. Protein pairs are binned by three different measures, which are “domain enrichment ratio” (DER) (B), “InterDom score,” (C) and “3did hit” (D). 3did hit means an interaction has a 3did interaction domain pair. E, GO coannotation SSBP number (
      • Rhodes D.R.
      • Tomlins S.A.
      • Varambally S.
      • Mahavisno V.
      • Barrette T.
      • Kalyana-Sundaram S.
      • Ghosh D.
      • Pandey A.
      • Chinnaiyan A.M.
      Probabilistic model of the human protein-protein interaction network.
      ) is used to measure the functional similarity of each pair of proteins. Protein pairs are grouped into confidence bins according to the SSBP number. F, genome context. Three types of genome contexts are evaluated: gene co-occurrence, gene fusion, and gene neighborhood. These genome contexts together with their confidence scores are downloaded from the STRING database (
      • Valencia A.
      • Pazos F.
      Computational methods for the prediction of protein interaction.
      ) with the authors’ permission. With the J48 pruned tree, protein pairs are binned into several confidence bins according to their prediction score in STRING (see supplemental Fig. S2 for details of bins). G, gene coexpression. Two gene expression data sets are downloaded from Refs.
      • Su A.I.
      • Welsh J.B.
      • Sapinoso L.M.
      • Kern S.G.
      • Dimitrov P.
      • Lapp H.
      • Schultz P.G.
      • Powell S.M.
      • Moskaluk C.A.
      • Frierson Jr., H.F.
      • Hampton G.M.
      Molecular classification of human carcinomas by use of gene expression signatures.
      and
      • Segal N.H.
      • Pavlidis P.
      • Noble W.S.
      • Antonescu C.R.
      • Viale A.
      • Wesley U.V.
      • Busam K.
      • Gallardo H.
      • DeSantis D.
      • Brennan M.F.
      • Cordon-Cardo C.
      • Wolchok J.D.
      • Houghton A.N.
      Classification of clear-cell sarcoma as a subtype of melanoma by genomic profiling.
      . Protein interactions are binned according to the pairwise expression Pearson correlation coefficients. H, network topology.

       Interaction Domain Data—

      Many of the protein interactions are mediated by interaction domains (
      • Pawson T.
      • Nash P.
      Assembly of cell regulatory systems through protein interaction domains.
      ). Therefore, if an interaction contains a pair of interaction domains, it will be more reliable. Considering that none of the current domain-domain interaction (DDI) databases have a satisfying coverage, we integrate three available strategies in our scoring system. We download the predicted DDI data from InterDom (
      • Ng S.K.
      • Zhang Z.
      • Tan S.H.
      • Lin K.
      InterDom: a database of putative interacting protein domains for validating predicted protein interactions and complexes.
      ) and the experimental DDI data from 3did (
      • Stein A.
      • Russell R.B.
      • Aloy P.
      3did: interacting protein domains of known three-dimensional structure.
      ). Besides the DDI in the database, we also compute the domain enrichment ratio in the golden standard positive (method described in Ref.
      • Rhodes D.R.
      • Tomlins S.A.
      • Varambally S.
      • Mahavisno V.
      • Barrette T.
      • Kalyana-Sundaram S.
      • Ghosh D.
      • Pandey A.
      • Chinnaiyan A.M.
      Probabilistic model of the human protein-protein interaction network.
      ) to identify the possible domain interactions. The DDI score from the InterDom database and the domain enrichment ratio are used as the explanatory variables to classify the confidence bins. In Fig. 2, B, C, and D, an apparent correlation is observed between LR and these variables, suggesting that all three strategies are suitable for confidence assessment.

       Functional Annotation Data—

      Interacting proteins often participate in the same biological process (
      • Uetz P.
      • Giot L.
      • Cagney G.
      • Mansfield T.A.
      • Judson R.S.
      • Knight J.R.
      • Lockshon D.
      • Narayan V.
      • Srinivasan M.
      • Pochart P.
      • Qureshi-Emili A.
      • Li Y.
      • Godwin B.
      • Conover D.
      • Kalbfleisch T.
      • Vijayadamodar G.
      • Yang M.
      • Johnston M.
      • Fields S.
      • Rothberg J.M.
      A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae.
      ,
      • Lehner B.
      • Fraser A.G.
      A first-draft human protein-interaction map.
      ). In recent years, a hierarchical, dynamically controlled vocabulary, Gene Ontology (GO), has been constructed to describe known biological roles of the genes or their products (
      • Ashburner M.
      • Ball C.A.
      • Blake J.A.
      • Botstein D.
      • Butler H.
      • Cherry J.M.
      • Davis A.P.
      • Dolinski K.
      • Dwight S.S.
      • Eppig J.T.
      • Harris M.A.
      • Hill D.P.
      • Issel-Tarver L.
      • Kasarskis A.
      • Lewis S.
      • Matese J.C.
      • Richardson J.E.
      • Ringwald M.
      • Rubin G.M.
      • Sherlock G.
      Gene Ontology: tool for the unification of biology.
      ). By GO, proteins sharing a more specific annotation are found to be more likely to interact with each other than those sharing less specific GO terms (
      • Xia K.
      • Dong D.
      • Han J.D.
      IntNetDB v1.0: an integrated protein-protein interaction network database generated by a probabilistic model.
      ). Here we introduce the smallest shared biological process (SSBP) (
      • Rhodes D.R.
      • Tomlins S.A.
      • Varambally S.
      • Mahavisno V.
      • Barrette T.
      • Kalyana-Sundaram S.
      • Ghosh D.
      • Pandey A.
      • Chinnaiyan A.M.
      Probabilistic model of the human protein-protein interaction network.
      ,
      • Li D.
      • Li J.
      • Ouyang S.
      • Wu S.
      • Wang J.
      • Xu X.
      • Zhu Y.
      • He F.
      An integrated strategy for functional analysis in large-scale proteomic research by Gene Ontology.
      ) to measure the functional similarity of a pair of proteins. From Fig. 2E, protein pairs with smaller SSBP tend to have the higher LRs in accordance with the previous report that a pair of proteins sharing a more specific GO term are more convincing (
      • Xia K.
      • Dong D.
      • Han J.D.
      IntNetDB v1.0: an integrated protein-protein interaction network database generated by a probabilistic model.
      ).

       Genome Context Data—

      Genes encoding interacting proteins always have a certain genome context, such as gene co-occurrence, gene neighborhoods, and gene fusion. These contexts have been used to predict the protein interactions (
      • Marcotte E.M.
      Computational genetics: finding protein function by nonhomology methods.
      ,
      • Valencia A.
      • Pazos F.
      Computational methods for the prediction of protein interaction.
      ). Here we identify the genome context of the human genes in the STRING database (
      • von Mering C.
      • Jensen L.J.
      • Kuhn M.
      • Chaffron S.
      • Doerks T.
      • Kruüger B.
      • Snel B.
      • Bork P.
      STRING 7—recent developments in the integration and prediction of protein interactions.
      ). Using the confidence score in STRING, we classify genome context into different confidence bins (supplemental Fig. S2). From Fig. 2F, we found that genome context data are also suitable for confidence evaluation.

       Genome-wide Gene Expression Data—

      Interacting proteins tend to be coexpressed especially in the same protein complexes or biological processes (
      • Deane C.M.
      • Salwinski L.
      • Xenarios I.
      • Eisenberg D.
      Protein interactions: two methods for assessment of the reliability of high throughput observations.
      ,
      • Eisen M.B.
      • Spellman P.T.
      • Brown P.O.
      • Botstein D.
      Cluster analysis and display of genome-wide expression patterns.
      ). Here we integrate two high quality large scale gene expression profiles from Refs.
      • Su A.I.
      • Welsh J.B.
      • Sapinoso L.M.
      • Kern S.G.
      • Dimitrov P.
      • Lapp H.
      • Schultz P.G.
      • Powell S.M.
      • Moskaluk C.A.
      • Frierson Jr., H.F.
      • Hampton G.M.
      Molecular classification of human carcinomas by use of gene expression signatures.
      and
      • Segal N.H.
      • Pavlidis P.
      • Noble W.S.
      • Antonescu C.R.
      • Viale A.
      • Wesley U.V.
      • Busam K.
      • Gallardo H.
      • DeSantis D.
      • Brennan M.F.
      • Cordon-Cardo C.
      • Wolchok J.D.
      • Houghton A.N.
      Classification of clear-cell sarcoma as a subtype of melanoma by genomic profiling.
      . For each profile, gene pairs are grouped into 20 bins according to pairwise expression Pearson correlation coefficient values. Fig. 2G shows a significant correlation between the expression Pearson correlation coefficient value and the LR, suggesting that gene coexpression can also be used for confidence assessment.

       Network Topological Structure Data—

      Interactions in potential functional modules are of higher confidence than others because cellular functions are often carried out by stably or transiently associated groups of proteins (
      • Milo R.
      • Shen-Orr S.
      • Itzkovitz S.
      • Kashtan N.
      • Chklovskii D.
      • Alon U.
      Network motifs: simple building blocks of complex networks.
      ). Therefore, network topological structure data are also integrated in our confidence scoring system. To identify the protein interactions with certain network topology, we combine the query interaction with the training protein interaction data to generate an integrated network for explanatory variables correlating with their confidences. For the interacting proteins 1 and 2, we define h1 and h2 as the number of their protein partners in the integrated network, respectively; h12 as the number of their shared protein partners; and Nfour as the number of the four-interaction loops in which an interaction participates. To measure the possibility of the interaction in three-interaction loops, we define hlog as Equation 1,
      hlog=-log10P=-log10(i=h12min(h1,h2)Ch1iCtotal-h1h2-iCtotalh2)
      (Eq. 1)


      where P is the probability of the shared neighbor for proteins 1 and 2 (
      • Goldberg D.S.
      • Roth F.P.
      Assessing experimentally derived interactions in small world.
      ), Cnk is the binomial coefficient for n chooses k, and total is the number of all the proteins in the network. From Fig. 2H, we can find that protein interactions in three- or four-interaction loops are of higher confidence in agreement with the previous report (
      • Uetz P.
      • Giot L.
      • Cagney G.
      • Mansfield T.A.
      • Judson R.S.
      • Knight J.R.
      • Lockshon D.
      • Narayan V.
      • Srinivasan M.
      • Pochart P.
      • Qureshi-Emili A.
      • Li Y.
      • Godwin B.
      • Conover D.
      • Kalbfleisch T.
      • Vijayadamodar G.
      • Yang M.
      • Johnston M.
      • Fields S.
      • Rothberg J.M.
      A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae.
      ).

       Combining the Confidence Scores from Individual Evidences by Bayesian Rules

      The Bayesian rules have been described extensively in several studies previously (
      • Jansen R.
      • Yu H.
      • Greenbaum D.
      • Kluger Y.
      • Krogan N.J.
      • Chung S.
      • Emili A.
      • Snyder M.
      • Greenblatt J.F.
      • Gerstein M.
      A Bayesian networks approach for predicting protein-protein interactions from genomic data.
      ,
      • Rhodes D.R.
      • Tomlins S.A.
      • Varambally S.
      • Mahavisno V.
      • Barrette T.
      • Kalyana-Sundaram S.
      • Ghosh D.
      • Pandey A.
      • Chinnaiyan A.M.
      Probabilistic model of the human protein-protein interaction network.
      ,
      • Xia K.
      • Dong D.
      • Han J.D.
      IntNetDB v1.0: an integrated protein-protein interaction network database generated by a probabilistic model.
      ,
      • Patil A.
      • Nakamura H.
      Filtering high-throughput protein-protein interaction data using a combination of genomic features.
      ). Here we define a pair of proteins that interact with each other as “positive” and those that do not interact as “negative.” Following a derivation of Bayesian rules (
      • Eddy S.R.
      What is Bayesian statistics?.
      ), the posterior odds (Opost) of an interaction can be calculated as the product of the prior odds (Oprior), and the likelihood ratio LR(f) can be calculated by Equation 2,
      Opost=P(positive|f)÷P(negative|f)=Oprior×LR(f)
      (Eq. 2)


      where P(positive|f) is the probability that a pair of proteins interacts after considering the biological evidence f, whereas P(negative|f) stands for the possibility that the pair does not interact. The prior odds is the ratio of the probability of detecting a pair of interacting proteins from all protein pairs that can be estimated by the golden standard data sets (Equation 3).
      Oprior=P(positive)÷(1-P(positive))
      (Eq. 3)


      The LR of biological evidence f is the ratio of the probability of meeting condition f of the interacting protein pair and the non-interacting protein pair in the golden standard data sets. From Equations 1 and 2, the LR can be computed as
      LR(f)=P(f|positive)÷P(f|negative)=TPf/TFPf/F
      (Eq. 4)


      where T and F are the number of all the true and false interactions, respectively, and TPf and FPf are the number of true and false interactions with the biological evidence f, respectively. The advantages of Bayesian rules in this system permit us to integrate multiple heterogeneous data sources into a probabilistic model. Because these biological data types integrated in PRINCESS are obtained by different approaches, we assume that they are conditionally independent. Therefore, we can get the composite LR (LRcomp) by simply multiplying the LRs from individual sources, which is namely the naïve Bayesian network (Equation 5).
      LR(f1fn)=i=1i=n(P(fi|positive)÷P(fi|negative))=i=1i=nLR(fi)
      (Eq. 5)


      According to the Bayesian rules described above, during the assessment procedure PRINCESS first finds the supporting evidences for the query interaction and assigns it the LR values. If the biological evidences in a same data type give more than one LR, the maximum will be retained. And then the naïve Bayesian network is used to integrate these LRs from multiple types of data sources to generate LRcomp for confidence assessment (Fig. 1).

       Receiver Operating Characteristic (ROC) Curve and Cross-validation

      A ROC curve can show the efficacy of one test by presenting both sensitivity and specificity for different cutoff points (
      • Baldi P.
      • Brunak S.
      • Chauvin Y.
      • Andersen C.A.
      • Nielsen H.
      Assessing the accuracy of prediction algorithms for classification: an overview.
      ). Sensitivity and specificity can measure the ability of a test to identify true positives and false positives in a data set. These two features can be calculated as Sensitivity = TP/T and Specificity = 1 − (FP/F) where TP and FP are the number of identified true and false positives, respectively, whereas T and F are the total number of positives and negatives in a test. The ROC curves are plotted and smoothed by SPSS software with the sensitivity on the y axis and 1 − Specificity on the x axis (
      • SPSS, Inc.
      SPSS Base 10.0 User's Guide.
      ).
      To test the efficacy of the overall performance of various assessment models, the 5-fold cross-validation protocol is used. The golden standard positive and negative data sets are randomly divided into five approximately equal subsets. Four sets are used as training data sets to compute the likelihood ratios of the individual evidence. The remaining set is used as the test data set to count the number of predicted true positives (TP) and false positives (FP) where one protein pair is predicted to be positive if its likelihood ratio exceeds a particular cutoff, LRcutoff, and to be negative otherwise. This process is done in turn five times, and finally the number of TPs and FPs against different likelihood ratios across five test data sets are summed to calculate the TP/FP ratio and the sensitivity (TP/T) and specificity (1 − (FP/F)) for the ROC curve.

      RESULTS

       Six Types of Biological Evidences Can Be Used to Assess the Confidence of Protein Interactions—

      We use the golden standard positive and negative data sets to measure the reliability of each biological evidence. Fig. 2 shows their likelihood ratios LR(f) for each biological evidence f. In theory, LR(f) > 1 indicates that biological evidence f has the ability to identify the true protein interactions from the HTPID. As seen in Fig. 2, all six biological evidences have LRs greater than 1, suggesting that all of them can be used to assess the confidence of the protein interactions. From Fig. 2, we can also find that there are great differences between the reliability of these six data types. Interacting domain, function annotation, network topology structure, and model protein-protein interactions have higher reliability, whereas the reliability of the gene expression and genome context is relatively lower. There are also great differences between different data sets of the same data types. Therefore, it is reasonable to take differences of the data sets into account when combining them for confidence assessments.

       The Combined Likelihood Ratio Can Be Used to Measure the Reliability of a Protein-Protein Interaction—

      Because the prior odds is a constant, the posterior odds is proportional to LR. Therefore LR can theoretically measure the reliability of an interaction (Equation 2). To test this speculation, during the 5-fold cross-validation against the golden standard data sets, we change the LRcutoff and plot the ratio of the true to false positive (TP/FP) as the function of the cutoff of likelihood ratio in Fig. 3. TP/FP acting as a measure to the accuracy of a test increases monotonically with the cutoff of likelihood ratio, confirming that the combined likelihood ratio can be used as an appropriate confidence score to measure the odds of a real interaction as well as the individual likelihood ratios. This is the fundament for various assessment models except Simple Voting Model described under “PRINCESS Has a Higher Sensitivity than Three Other Kinds of Assessment Models for Comparable Specificity.” A protein pair with an LR greater than 1 is supposed to be supported by at least one biological evidence. Of course, users can set a higher likelihood ratio threshold value to filter out the higher confidence interactions.
      Figure thumbnail gr3
      Fig. 3TP/FP ratio as a function of likelihood ratio cutoff for two types of assessment models. This figure plots the TP/FP ratio as a function of likelihood ratio cutoff for two types of assessment models. Interolog, Interacting Domain, GO Coannotation, Genome Context, Gene Coexpression, and Network Topology denote the six Single Evidence Models, whereas PRINCESS denotes the Bayesian model integrating these six evidences. The number of true positives and false positives are from the 5-fold cross-validation (see text for details).

       PRINCESS Has a Higher Sensitivity than Three Other Kinds of Assessment Models for Comparable Specificity—

      To compare the efficacy of PRINCESS with those other assessment methods in the previous literature (
      • Deane C.M.
      • Salwinski L.
      • Xenarios I.
      • Eisenberg D.
      Protein interactions: two methods for assessment of the reliability of high throughput observations.
      ,
      • Matthews L.R.
      • Vaglio P.
      • Reboul J.
      • Ge H.
      • Davis B.P.
      • Garrels J.
      • Vincent S.
      • Vidal M.
      Identification of potential interaction networks using sequence-based searches for conserved protein-protein interactions or “interologs”.
      ,
      • Ng S.K.
      • Zhang Z.
      • Tan S.H.
      • Lin K.
      InterDom: a database of putative interacting protein domains for validating predicted protein interactions and complexes.
      ,
      • Lehner B.
      • Fraser A.G.
      A first-draft human protein-interaction map.
      ,
      • Kemmeren P.
      • van Berkum N.L.
      • Vilo J.
      • Bijma T.
      • Donders R.
      • Brazma A.
      • Holstege F.C.
      Protein interaction verification and functional annotation by integrated analysis of genome-scale data.
      ,
      • Hahn A.
      • Rahnenfuührer J.
      • Talwar P.
      • Lengauer T.
      Confirmation of human protein interaction data by human expression data.
      ,
      • Goldberg D.S.
      • Roth F.P.
      Assessing experimentally derived interactions in small world.
      ,
      • Saito R.
      • Suzuki H.
      • Hayashizaki Y.
      Interaction generality, a measurement to assess the reliability of a protein-protein interaction.
      ,
      • Bader J.S.
      • Chaudhuri A.
      • Rothberg J.M.
      • Chant J.
      Gaining confidence in high-throughput protein interaction networks.
      ,
      • Ashburner M.
      • Ball C.A.
      • Blake J.A.
      • Botstein D.
      • Butler H.
      • Cherry J.M.
      • Davis A.P.
      • Dolinski K.
      • Dwight S.S.
      • Eppig J.T.
      • Harris M.A.
      • Hill D.P.
      • Issel-Tarver L.
      • Kasarskis A.
      • Lewis S.
      • Matese J.C.
      • Richardson J.E.
      • Ringwald M.
      • Rubin G.M.
      • Sherlock G.
      Gene Ontology: tool for the unification of biology.
      ,
      • Eddy S.R.
      What is Bayesian statistics?.
      ), we first construct several assessment models simulating those methods. For those single evidence methods, we establish the Single Evidence Models where the confidence of each protein interaction is assigned by likelihood ratio of the confidence bins of individual evidence. To simulate the method of Stelz et al. (
      • Stelzl U.
      • Worm U.
      • Lalowski M.
      • Haenig C.
      • Brembeck F.H.
      • Goehler H.
      • Stroedicke M.
      • Zenkner M.
      • Schoenherr A.
      • Koeppen S.
      • Timm J.
      • Mintzlaff S.
      • Abraham C.
      • Bock N.
      • Kietzmann S.
      • Goedde A.
      • Toksoz E.
      • Droege A.
      • Krobitsch S.
      • Korn B.
      • Birchmeier W.
      • Lehrach H.
      • Wanker E.E.
      A human protein-protein interaction network: a resource for annotating the proteome.
      ), the Simple Voting Model is established where every protein interaction is assigned a confidence score by the number of its supported biological evidences (this is determined by whether the individual likelihood ratio of the feature is greater than LRcutoff). Especially to compare PRINCESS with the Bayesian model of Patil and Nakamura (
      • Patil A.
      • Nakamura H.
      Filtering high-throughput protein-protein interaction data using a combination of genomic features.
      ), we use only three biological evidences, “Interolog,” “Interacting Domain,” and “GO Coannotation” to construct the “Three Evidence Model.”
      We also use the 5-fold cross-validation protocol to evaluate the performance of PRINCESS. The resulting ROC curves are illustrated in Fig. 4. Each point on the ROC curve of each assessment model denotes the sensitivity and specificity obtained from one test against a particular LRcutoff. The area under the ROC curve (AUC) is an indicator of the efficacy of the assessment system. An ideal test with perfect discrimination (100% sensitivity and 100% specificity) has an AUC of 1.0, whereas a non-informative prediction has the area 0.5, indicating that it may be achieved by mere guess. The more the AUC of a test approximates 1.0, the higher the overall efficacy of the test will be. We find that our improved Bayesian model has an area approximating 0.9, suggesting that it has a relatively high ability to identify the true interactions against the test data sets.
      Figure thumbnail gr4
      Fig. 4ROC curves for various assessment models using 5-fold cross-validations against the golden standard data sets. Each point on the ROC curves of various assessment models corresponds to sensitivity and specificity against a particular likelihood ratio cutoff. Names of the different assessment models corresponding to these curves are shown in the legends. Different colors are used to distinguish the curves for different models. Interolog, Interacting Domain, GO Coannotation, Genome Context, Gene Coexpression, and Network Topology denote the six Single Evidence Models, whereas Three Evidence Model, Simple Voting Model, and PRINCESS denote the other three assessment models with multiple evidences. And the area under the curve is also presented in the figure. Sensitivity and specificity are computed during the 5-fold cross-validations (see text for details). SPSS software is used to smooth the curves (
      • SPSS, Inc.
      SPSS Base 10.0 User's Guide.
      ). TPF, true positive fraction; FPF, false positive fraction.
      Because the AUC is an indicator of the discriminatory power for the assessment system, here we also use it to compare the prediction efficacy of different assessment models. From Fig. 4, we notice that those Single Evidence Models have different AUC values in accord with the previous conclusion that there are great differences in their reliability and that the efficacy of these models is lower than that of multiple evidence models. The Simple Voting Model integrating multiple data sources also has relatively high efficacy. However, its efficacy is still lower than that of PRINCESS possibly because the Bayesian system considers the difference of biological evidences’ reliability. Especially, here we also compare the performance of PRINCESS with that of the Three Evidence Model (
      • Patil A.
      • Nakamura H.
      Filtering high-throughput protein-protein interaction data using a combination of genomic features.
      ). We find that the extra three features can improve assessment efficacy significantly (Fig. 4), although the evaluation ability of the incorporated individual gene coexpression and genome context evidence is relatively low (Fig. 4).

       Improved Bayesian Model Can Predict True Interactions from Human High Throughput Data Sets—

      The current version of PRINCESS is used mainly to predict true interactions from human high throughput protein interaction data sets. Authors of high throughput data sets usually assign a confidence level to interactions by experimental or bioinformatics approaches. Here we evaluate the reliability of two human HTPIDs by assigning LR values (
      • Uetz P.
      • Giot L.
      • Cagney G.
      • Mansfield T.A.
      • Judson R.S.
      • Knight J.R.
      • Lockshon D.
      • Narayan V.
      • Srinivasan M.
      • Pochart P.
      • Qureshi-Emili A.
      • Li Y.
      • Godwin B.
      • Conover D.
      • Kalbfleisch T.
      • Vijayadamodar G.
      • Yang M.
      • Johnston M.
      • Fields S.
      • Rothberg J.M.
      A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae.
      ,
      • Rual J.F.
      • Venkatesan K.
      • Hao T.
      • Hirozane-Kishikawa T.
      • Dricot A.
      • Li N.
      • Berriz G.F.
      • Gibbons F.D.
      • Dreze M.
      • Ayivi-Guedehoussou N.
      • Klitgord N.
      • Simon C.
      • Boxem M.
      • Milstein S.
      • Rosenberg J.
      • Goldberg D.S.
      • Zhang L.V.
      • Wong S.L.
      • Franklin G.
      • Li S.
      • Albala J.S.
      • Lim J.
      • Fraughton C.
      • Llamosas E.
      • Cevik S.
      • Bex C.
      • Lamesch P.
      • Sikorski R.S.
      • Vandenhaute J.
      • Zoghbi H.Y.
      • Smolyar A.
      • Bosak S.
      • Sequerra R.
      • Doucette-Stamm L.
      • Cusick M.E.
      • Hill D.E.
      • Roth F.P.
      • Vidal M.
      Towards a proteome-scale map of the human protein-protein interaction network.
      ). Those high throughput interactions with LR > 2 are predicted as true interactions. We show the percentage of interactions predicted true across different data sets with multiple confidence. As can be seen from Fig. 5, the higher the confidence of the protein interaction data set in the literature is, the higher the percentage of interactions predicted as true will be, suggesting that PRINCESS has the ability to filter out the high confidence protein interactions from the HTPIDs.
      Figure thumbnail gr5
      Fig. 5Percentage of interactions predicted true in the reported human protein interaction data sets. The two panels show the percentage of the predicted true interactions across protein interaction data sets with different confidence. HTPID in A is downloaded from Ref.
      • Stelzl U.
      • Worm U.
      • Lalowski M.
      • Haenig C.
      • Brembeck F.H.
      • Goehler H.
      • Stroedicke M.
      • Zenkner M.
      • Schoenherr A.
      • Koeppen S.
      • Timm J.
      • Mintzlaff S.
      • Abraham C.
      • Bock N.
      • Kietzmann S.
      • Goedde A.
      • Toksoz E.
      • Droege A.
      • Krobitsch S.
      • Korn B.
      • Birchmeier W.
      • Lehrach H.
      • Wanker E.E.
      A human protein-protein interaction network: a resource for annotating the proteome.
      . Based on experimental and bioinformatics criteria, Stelz et al. (
      • Stelzl U.
      • Worm U.
      • Lalowski M.
      • Haenig C.
      • Brembeck F.H.
      • Goehler H.
      • Stroedicke M.
      • Zenkner M.
      • Schoenherr A.
      • Koeppen S.
      • Timm J.
      • Mintzlaff S.
      • Abraham C.
      • Bock N.
      • Kietzmann S.
      • Goedde A.
      • Toksoz E.
      • Droege A.
      • Krobitsch S.
      • Korn B.
      • Birchmeier W.
      • Lehrach H.
      • Wanker E.E.
      A human protein-protein interaction network: a resource for annotating the proteome.
      ) grouped 2,618 yeast two-hybrid interactions into three confidence levels: low (LC), medium (MC), and high (HC) confidence. The data set in B is downloaded from Ref.
      • Rual J.F.
      • Venkatesan K.
      • Hao T.
      • Hirozane-Kishikawa T.
      • Dricot A.
      • Li N.
      • Berriz G.F.
      • Gibbons F.D.
      • Dreze M.
      • Ayivi-Guedehoussou N.
      • Klitgord N.
      • Simon C.
      • Boxem M.
      • Milstein S.
      • Rosenberg J.
      • Goldberg D.S.
      • Zhang L.V.
      • Wong S.L.
      • Franklin G.
      • Li S.
      • Albala J.S.
      • Lim J.
      • Fraughton C.
      • Llamosas E.
      • Cevik S.
      • Bex C.
      • Lamesch P.
      • Sikorski R.S.
      • Vandenhaute J.
      • Zoghbi H.Y.
      • Smolyar A.
      • Bosak S.
      • Sequerra R.
      • Doucette-Stamm L.
      • Cusick M.E.
      • Hill D.E.
      • Roth F.P.
      • Vidal M.
      Towards a proteome-scale map of the human protein-protein interaction network.
      . This human protein interaction data set is compiled of three confidence groups. “Core” contains 624 interactions supported by at least two PubMed entries, “Non-core” contains 3,443 interactions supported by only one PubMed entry, and the other 2,699 interactions are screened interactions without literature support (yeast two-hybrid (“Y2H”)). Each interaction in these data sets is assigned a confidence score by PRINCESS. An interaction with a score greater than 2.0 is regarded as true.

       PRINCESS Web Service—

      The PRINCESS system is implemented under the UNIX environment with the PRINCESS data stored in the relational database for retrieval. Automated methods for searching and dynamically displaying the assessment result and the annotation are built with the combination of Perl (practical extraction and report language) and HTML (hypertext markup language). To improve the speed of analyses, parallel processing is used.
      Using the PRINCESS Web service is a simple two-step process. In the first step, the user is asked to provide the properly formatted protein interaction data, which are pairs of gene or protein identifiers separated by comma. Now PRINCESS can accept two types of identifiers, the official gene symbols (
      • Wain H.M.
      • Lush M.
      • Ducluzeau F.
      • Povey S.
      Genew: the human gene nomenclature database.
      ) and International Protein Index (IPI) protein identifiers (
      • Kersey P.J.
      • Duarte J.
      • Williams A.
      • Karavidopoulou Y.
      • Birney E.
      • Apweiler R.
      The International Protein Index: an integrated database for proteomics experiments.
      ). Meanwhile the user may select their desired biological features and appropriate parameters. The PRINCESS Web application facilitates the query via a Web browser-based interface (Fig. 6A). In the second step, users will be presented with the table of analytical results. LRs will be given to illustrate the confidence of each interaction. Interactions with LRs greater than 1 are those that have one biological feature at least. The larger the LR value the protein interaction has, the more reliable it will be (Fig. 6B).
      Figure thumbnail gr6
      Fig. 6Web services of PRINCESS.A, the PRINCESS home page. Users may submit their protein interaction items directly into the form of this interface. B, PRINCESS evaluation results. For these submitted protein interactions, PRINCESS will show the assessment results in table form where “V” and “X” stand for “present” and “absent,” respectively, in regard to certain biological evidence for an interaction. C, visualization of the evaluated protein interaction network. The color of an edge denotes its evidence level. Clicking on the proteins or interactions will lead to the hyperlinked pages to show more detailed information. D, supporting evidences for the query protein interactions. For the query protein interactions, PRINCESS can show the detailed annotation information. Especially PRINCESS can give the graph view of the interaction domains and the network position of the query protein interactions. HC, high confidence interaction; LC, low confidence interaction; EXHC, extended interaction in full network from high confidence query interaction; EXLC, extended interaction in full network from low confidence query interaction; JPG, joint photographic experts group; PNG, portable network graphics; PS, post scriptum.
      During the confidence assessment, each protein interaction will be annotated by multiple biological evidences. PRINCESS presents the detailed information via HTML pages (Fig. 6D). The hyperlink of each item in these pages will present abundant detailed information for the protein interactions. Because the figure views are more informative, PRINCESS presents the evaluated protein interaction and the linked interactions in figures, which can help users better understand the position of this protein interaction in the full human protein interaction network (Fig. 6C). In addition, PRINCESS can also illustrate some detailed information, such as the presentation of the three- or four-interaction loops and interacting domains, which are helpful for the user to understand the neighborhood and the structural basis for their query interactions (Fig. 6D). Here we use SVG (scalable vector graphics) language to program these figures because it allows additional functionalities such as zooming in without loss of resolution. Most importantly, SVG language can also link to the protein/interaction annotation molecule page by simply clicking its interactive elements.

      DISCUSSION

      Although biological discovery has benefited from large scale proteomics data, it is still a very big challenge to extract confident biological conclusions based on these data. One of the main reasons is the high proportion of “false positives” in these proteomics data (
      • von Mering C.
      • Krause R.
      • Snel B.
      • Cornell M.
      • Oliver S.G.
      • Fields S.
      • Bork P.
      Comparative assessment of large-scale data sets of protein-protein interactions.
      ,
      • Bader J.S.
      • Chaudhuri A.
      • Rothberg J.M.
      • Chant J.
      Gaining confidence in high-throughput protein interaction networks.
      ). Analysis of these data often lacks their confidence information. In this study, we present a novel assessment system to evaluate the reliability of high throughput human protein interaction data. We first construct multiple biological evidences and use LR to measure their reliability, respectively. Then we use naïve Bayesian networks to combine the individual evidences for confidence assessment. This system is proved to have high sensitivity and good specificity by cross-validation. This system also has the ability to filter out true interactions from human HTPID.
      Compared with previous assessment models, PRINCESS gives the best performance against the golden standard data sets. This advantage may result from the following points. 1) PRINCESS integrates more than one biological evidence; this can reduce the false positive and false negative derived from single evidence. 2) PRINCESS measures the reliability of each biological evidence by LR not by simple voting. 3) PRINCESS stratifies a data set into different confidence bins, improving its sensitivity to identify the true protein interaction. These improvements make PRINCESS more informative and predictable.
      Besides the confidence assessment, another advantage of PRINCESS over other similar tools is that it can be used to annotate the protein interactions from multiple aspects. During experiment design for exploring the function of one protein, for example, it is often important to find the biological evidence for the potential interactions in which the protein participates. PRINCESS can find these supporting evidences for the candidate interactions. To facilitate this strategy by the community, we have developed our strategy into a professional protein-protein interaction confidence assessment and annotation Web service that supports on-line query with multiple options, network visualization, and detailed information presentation.
      In PRINCESS, we integrated multiple heterogeneous data sources. There is the possibility that some of the supporting evidences for the interactions are derived from the same “wet lab” experiment that is just the source to generate the validated data set. During the 5-fold cross-validation, each time we regard these interactions without these evidences. Therefore, they will not exaggerate the performance of PRINCESS.
      In addition, we have applied PRINCESS not only to human but also to yeast, and we achieved equal excellent performance (because of the page limitation, the details of the yeast results are presented in supplemental Fig. S3). However, we have to admit that PRINCESS, as a bioinformatic tool that heavily depends on the current biological data sets, might make some wrong decisions for those true interactions supported by few or even none of these six evidences because some less studied genes/proteins indeed may lack certain biological information nowadays, such as abundance, biological process, cellular localization, or interaction domain.
      However, the Bayesian model used in PRINCESS permits us to integrate more and more efficient heterogeneous biological data. Actually we are now planning to integrate genetic interaction, phylogenetic distance, and the experimental data into PRINCESS. With more biological data sources (depending on biological technology development) and more types of evidences integrated into PRINCESS, she will achieve better performance, and her “false negative” will be gradually reduced.

      Acknowledgments

      We thank Christian von Mering for kindly supplying the STRING data sets; Songfeng Wu, Lei Dou, Hao Guo, Jianqi Li, and Lin Hou for fruitful discussions; and Dongsheng Li for hardware and software supports. We also thank two anonymous reviewers for helpful comments.

      Supplementary Material

      REFERENCES

        • Hartwell L.H.
        • Hopfield J.J.
        • Leibler S.
        • Murray A.W.
        From molecular to modular cell biology.
        Nature. 1999; 402: C47-C52
        • Bray D.
        Molecular networks: the top-down view.
        Science. 2003; 301: 1864-1865
        • Uetz P.
        • Giot L.
        • Cagney G.
        • Mansfield T.A.
        • Judson R.S.
        • Knight J.R.
        • Lockshon D.
        • Narayan V.
        • Srinivasan M.
        • Pochart P.
        • Qureshi-Emili A.
        • Li Y.
        • Godwin B.
        • Conover D.
        • Kalbfleisch T.
        • Vijayadamodar G.
        • Yang M.
        • Johnston M.
        • Fields S.
        • Rothberg J.M.
        A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae.
        Nature. 2000; 403: 623-627
        • Ito T.
        • Chiba T.
        • Ozawa R.
        • Yoshida M.
        • Hattori M.
        • Sakaki Y.
        A comprehensive two-hybrid analysis to explore the yeast protein interactome.
        Proc. Natl. Acad. Sci. U. S. A. 2001; 98: 4569-4574
        • Giot L.
        • Bader J.S.
        • Brouwer C.
        • Chaudhuri A.
        • Kuang B.
        • Li Y.
        • Hao Y.L.
        • Ooi C.E.
        • Godwin B.
        • Vitols E.
        • Vijayadamodar G.
        • Pochart P.
        • Machineni H.
        • Welsh M.
        • Kong Y.
        • Zerhusen B.
        • Malcolm R.
        • Varrone Z.
        • Collis A.
        • Minto M.
        • Burgess S.
        • McDaniel L.
        • Stimpson E.
        • Spriggs F.
        • Williams J.
        • Neurath K.
        • Ioime N.
        • Agee M.
        • Voss E.
        • Furtak K.
        • Renzulli R.
        • Aanensen N.
        • Carrolla S.
        • Bickelhaupt E.
        • Lazovatsky Y.
        • DaSilva A.
        • Zhong J.
        • Stanyon C.A.
        • Finley Jr., R.L.
        • White K.P.
        • Braverman M.
        • Jarvie T.
        • Gold S.
        • Leach M.
        • Knight J.
        • Shimkets R.A.
        • McKenna M.P.
        • Chant J.
        • Rothberg J.M.
        A protein interaction map of Drosophila melanogaster.
        Science. 2003; 302: 1727-1736
        • Li S.
        • Armstrong C.M.
        • Bertin N.
        • Ge H.
        • Milstein S.
        • Boxem M.
        • Vidalain P.O.
        • Han J.D.
        • Chesneau A.
        • Hao T.
        • Goldberg D.S.
        • Li N.
        • Martinez M.
        • Rual J.F.
        • Lamesch P.
        • Xu L.
        • Tewari M.
        • Wong S.L.
        • Zhang L.V.
        • Berriz G.F.
        • Jacotot L.
        • Vaglio P.
        • Reboul J.
        • Hirozane-Kishikawa T.
        • Li Q.
        • Gabel H.W.
        • Elewa A.
        • Baumgartner B.
        • Rose D.J.
        • Yu H.
        • Bosak S.
        • Sequerra R.
        • Fraser A.
        • Mango S.E.
        • Saxton W.M.
        • Strome S.
        • Van Den Heuvel S.
        • Piano F.
        • Vandenhaute J.
        • Sardet C.
        • Gerstein M.
        • Doucette-Stamm L.
        • Gunsalus K.C.
        • Harper J.W.
        • Cusick M.E.
        • Roth F.P.
        • Hill D.E.
        • Vidal M.
        A map of the interactome network of the metazoan C. elegans.
        Science. 2004; 303: 540-543
        • Stelzl U.
        • Worm U.
        • Lalowski M.
        • Haenig C.
        • Brembeck F.H.
        • Goehler H.
        • Stroedicke M.
        • Zenkner M.
        • Schoenherr A.
        • Koeppen S.
        • Timm J.
        • Mintzlaff S.
        • Abraham C.
        • Bock N.
        • Kietzmann S.
        • Goedde A.
        • Toksoz E.
        • Droege A.
        • Krobitsch S.
        • Korn B.
        • Birchmeier W.
        • Lehrach H.
        • Wanker E.E.
        A human protein-protein interaction network: a resource for annotating the proteome.
        Cell. 2005; 122: 957-968
        • Rual J.F.
        • Venkatesan K.
        • Hao T.
        • Hirozane-Kishikawa T.
        • Dricot A.
        • Li N.
        • Berriz G.F.
        • Gibbons F.D.
        • Dreze M.
        • Ayivi-Guedehoussou N.
        • Klitgord N.
        • Simon C.
        • Boxem M.
        • Milstein S.
        • Rosenberg J.
        • Goldberg D.S.
        • Zhang L.V.
        • Wong S.L.
        • Franklin G.
        • Li S.
        • Albala J.S.
        • Lim J.
        • Fraughton C.
        • Llamosas E.
        • Cevik S.
        • Bex C.
        • Lamesch P.
        • Sikorski R.S.
        • Vandenhaute J.
        • Zoghbi H.Y.
        • Smolyar A.
        • Bosak S.
        • Sequerra R.
        • Doucette-Stamm L.
        • Cusick M.E.
        • Hill D.E.
        • Roth F.P.
        • Vidal M.
        Towards a proteome-scale map of the human protein-protein interaction network.
        Nature. 2005; 437: 1173-1178
        • Ho Y.
        • Gruhler A.
        • Heilbut A.
        • Bader G.D.
        • Moore L.
        • Adams S.L.
        • Millar A.
        • Taylor P.
        • Bennett K.
        • Boutilier K.
        • Yang L.
        • Wolting C.
        • Donaldson I.
        • Schandorff S.
        • Shewnarane J.
        • Vo M.
        • Taggart J.
        • Goudreault M.
        • Muskat B.
        • Alfarano C.
        • Dewar D.
        • Lin Z.
        • Michalickova K.
        • Willems A.R.
        • Sassi H.
        • Nielsen P.A.
        • Rasmussen K.J.
        • Andersen J.R.
        • Johansen L.E.
        • Hansen L.H.
        • Jespersen H.
        • Podtelejnikov A.
        • Nielsen E.
        • Crawford J.
        • Poulsen V.
        • Sorensen B.D.
        • Matthiesen J.
        • Hendrickson R.C.
        • Gleeson F.
        • Pawson T.
        • Moran M.F.
        • Durocher D.
        • Mann M.
        • Hogue C.W.
        • Figeys D.
        • Tyers M.
        Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry.
        Nature. 2002; 415: 180-183
        • Gavin A.C.
        • Bosche M.
        • Krause R.
        • Grandi P.
        • Marzioch M.
        • Bauer A.
        • Schultz J.
        • Rick J.M.
        • Michon A.M.
        • Cruciat C.M.
        • Remor M.
        • Hofert C.
        • Schelder M.
        • Brajenovic M.
        • Ruffner H.
        • Merino A.
        • Klein K.
        • Hudak M.
        • Dickson D.
        • Rudi T.
        • Gnau V.
        • Bauch A.
        • Bastuck S.
        • Huhse B.
        • Leutwein C.
        • Heurtier M.A.
        • Copley R.R.
        • Edelmann A.
        • Querfurth E.
        • Rybin V.
        • Drewes G.
        • Raida M.
        • Bouwmeester T.
        • Bork P.
        • Seraphin B.
        • Kuster B.
        • Neubauer G.
        • Superti-Furga G.
        Functional organization of the yeast proteome by systematic analysis of protein complexes.
        Nature. 2002; 415: 141-147
        • von Mering C.
        • Krause R.
        • Snel B.
        • Cornell M.
        • Oliver S.G.
        • Fields S.
        • Bork P.
        Comparative assessment of large-scale data sets of protein-protein interactions.
        Nature. 2002; 417: 399-403
        • Lin N.
        • Zhao H.
        Are scale-free networks robust to measurement errors?.
        BMC Bioinformatics. 2005; 6: 119
        • Han J.D.
        • Dupuy D.
        • Bertin N.
        • Cusick M.E.
        • Vidal M.
        Effect of sampling on topology predictions of protein-protein interaction networks.
        Nat. Biotechnol. 2005; 23: 839-844
        • Han J.D.
        • Bertin N.
        • Hao T.
        • Goldberg D.S.
        • Berriz G.F.
        • Zhang L.V.
        • Dupuy D.
        • Walhout A.J.
        • Cusick M.E.
        • Roth F.P.
        • Vidal M.
        Evidence for dynamically organized modularity in the yeast protein-protein interaction network.
        Nature. 2004; 430: 88-93
        • Deane C.M.
        • Salwinski L.
        • Xenarios I.
        • Eisenberg D.
        Protein interactions: two methods for assessment of the reliability of high throughput observations.
        Mol. Cell. Proteomics. 2002; 1: 349-356
        • Matthews L.R.
        • Vaglio P.
        • Reboul J.
        • Ge H.
        • Davis B.P.
        • Garrels J.
        • Vincent S.
        • Vidal M.
        Identification of potential interaction networks using sequence-based searches for conserved protein-protein interactions or “interologs”.
        Genome Res. 2001; 11: 2120-2126
        • Ng S.K.
        • Zhang Z.
        • Tan S.H.
        • Lin K.
        InterDom: a database of putative interacting protein domains for validating predicted protein interactions and complexes.
        Nucleic Acids Res. 2003; 31: 251-254
        • Lehner B.
        • Fraser A.G.
        A first-draft human protein-interaction map.
        Genome Biol. 2004; 5: R63
        • Kemmeren P.
        • van Berkum N.L.
        • Vilo J.
        • Bijma T.
        • Donders R.
        • Brazma A.
        • Holstege F.C.
        Protein interaction verification and functional annotation by integrated analysis of genome-scale data.
        Mol. Cell. 2002; 9: 1133-1143
        • Hahn A.
        • Rahnenfuührer J.
        • Talwar P.
        • Lengauer T.
        Confirmation of human protein interaction data by human expression data.
        BMC Bioinformatics. 2005; 6: 112
        • Goldberg D.S.
        • Roth F.P.
        Assessing experimentally derived interactions in small world.
        Proc. Natl. Acad. Sci. U. S. A. 2003; 100: 4372-4376
        • Saito R.
        • Suzuki H.
        • Hayashizaki Y.
        Interaction generality, a measurement to assess the reliability of a protein-protein interaction.
        Nucleic Acids Res. 2002; 30: 1163-1168
        • Bader J.S.
        • Chaudhuri A.
        • Rothberg J.M.
        • Chant J.
        Gaining confidence in high-throughput protein interaction networks.
        Nat. Biotechnol. 2004; 22: 78-85
        • Ashburner M.
        • Ball C.A.
        • Blake J.A.
        • Botstein D.
        • Butler H.
        • Cherry J.M.
        • Davis A.P.
        • Dolinski K.
        • Dwight S.S.
        • Eppig J.T.
        • Harris M.A.
        • Hill D.P.
        • Issel-Tarver L.
        • Kasarskis A.
        • Lewis S.
        • Matese J.C.
        • Richardson J.E.
        • Ringwald M.
        • Rubin G.M.
        • Sherlock G.
        Gene Ontology: tool for the unification of biology.
        Nat. Genet. 2000; 25: 25-29
        • Eddy S.R.
        What is Bayesian statistics?.
        Nat. Biotechnol. 2004; 22: 1177-1178
        • Jansen R.
        • Yu H.
        • Greenbaum D.
        • Kluger Y.
        • Krogan N.J.
        • Chung S.
        • Emili A.
        • Snyder M.
        • Greenblatt J.F.
        • Gerstein M.
        A Bayesian networks approach for predicting protein-protein interactions from genomic data.
        Science. 2003; 302: 449-453
        • Rhodes D.R.
        • Tomlins S.A.
        • Varambally S.
        • Mahavisno V.
        • Barrette T.
        • Kalyana-Sundaram S.
        • Ghosh D.
        • Pandey A.
        • Chinnaiyan A.M.
        Probabilistic model of the human protein-protein interaction network.
        Nat. Biotechnol. 2005; 23: 951-959
        • Xia K.
        • Dong D.
        • Han J.D.
        IntNetDB v1.0: an integrated protein-protein interaction network database generated by a probabilistic model.
        BMC Bioinformatics. 2006; 7: 508
        • Patil A.
        • Nakamura H.
        Filtering high-throughput protein-protein interaction data using a combination of genomic features.
        BMC Bioinformatics. 2005; 6: 100
        • Peri S.
        • Navarro J.D.
        • Kristiansen T.Z.
        • Amanchy R.
        • Surendranath V.
        • Muthusamy B.
        • Gandhi T.K.
        • Chandrika K.N.
        • Deshpande N.
        • Suresh S.
        • Rashmi B.P.
        • Shanker K.
        • Padma N.
        • Niranjan V.
        • Harsha H.C.
        • Talreja N.
        • Vrushabendra B.M.
        • Ramya M.A.
        • Yatish A.J.
        • Joy M.
        • Shivashankar H.N.
        • Kavitha M.P.
        • Menezes M.
        • Choudhury D.R.
        • Ghosh N.
        • Saravana R.
        • Chandran S.
        • Mohan S.
        • Jonnalagadda C.K.
        • Prasad C.K.
        • Kumar-Sinha C.
        • Deshpande K.S.
        • Pandey A.
        Human protein reference database as a discovery resource for proteomics.
        Nucleic Acids Res. 2004; 32: D497-D501
        • Pagel P.
        • Mewes H.W.
        • Frishman D.
        Conservation of protein-protein interactions—lessons from ascomycota.
        Trends Genet. 2004; 20: 72-76
        • Kelley B.P.
        • Sharan R.
        • Karp R.M.
        • Sittler T.
        • Root D.E.
        • Stockwell B.R.
        • Ideker T.
        Conserved pathways within bacteria and yeast as revealed by global protein network alignment.
        Proc. Natl. Acad. Sci. U. S. A. 2003; 100: 11394-11399
        • Salwinski L.
        • Miller C.S.
        • Smith A.J.
        • Pettit F.K.
        • Bowie J.U.
        • Eisenberg D.
        The Database of Interacting Proteins: 2004 update.
        Nucleic Acids Res. 2004; 32: D449-D451
        • O'Brien K.P.
        • Remm M.
        • Sonnhammer E.L.
        Inparanoid: a comprehensive database of eukaryotic orthologs.
        Nucleic Acids Res. 2005; 33: D476-D480
        • Witten I.H.
        • Frank E.
        Data Mining: Practical Machine Learning Techniques with Java Implementations.
        Morgan Kaufmann, San Francisco2000
        • Frank E.
        • Hall M.A.
        • Holmes G.
        • Kirkby R.
        • Pfahringer B.
        • Witten I.H.
        • Trigg L.
        Weka.
        in: Maimon O. Rokach R. The Data Mining and Knowledge Discovery Handbook. Springer, New York2005: 1305-1314
        • Pawson T.
        • Nash P.
        Assembly of cell regulatory systems through protein interaction domains.
        Science. 2003; 300: 445-452
        • Kersey P.J.
        • Duarte J.
        • Williams A.
        • Karavidopoulou Y.
        • Birney E.
        • Apweiler R.
        The International Protein Index: an integrated database for proteomics experiments.
        Proteomics. 2004; 4: 1985-1988
        • Stein A.
        • Russell R.B.
        • Aloy P.
        3did: interacting protein domains of known three-dimensional structure.
        Nucleic Acids Res. 2005; 33: D413-D417
        • Li D.
        • Li J.
        • Ouyang S.
        • Wu S.
        • Wang J.
        • Xu X.
        • Zhu Y.
        • He F.
        An integrated strategy for functional analysis in large-scale proteomic research by Gene Ontology.
        Prog. Biochem. Biophys. 2005; 32: 1026-1029
        • Marcotte E.M.
        Computational genetics: finding protein function by nonhomology methods.
        Curr. Opin. Struct. Biol. 2000; 10: 359-365
        • Valencia A.
        • Pazos F.
        Computational methods for the prediction of protein interaction.
        Curr. Opin. Struct. Biol. 2002; 12: 368-373
        • von Mering C.
        • Jensen L.J.
        • Kuhn M.
        • Chaffron S.
        • Doerks T.
        • Kruüger B.
        • Snel B.
        • Bork P.
        STRING 7—recent developments in the integration and prediction of protein interactions.
        Nucleic Acids Res. 2007; 35: D358-D362
        • Eisen M.B.
        • Spellman P.T.
        • Brown P.O.
        • Botstein D.
        Cluster analysis and display of genome-wide expression patterns.
        Proc. Natl. Acad. Sci. U. S. A. 1998; 95: 4863-14868
        • Su A.I.
        • Welsh J.B.
        • Sapinoso L.M.
        • Kern S.G.
        • Dimitrov P.
        • Lapp H.
        • Schultz P.G.
        • Powell S.M.
        • Moskaluk C.A.
        • Frierson Jr., H.F.
        • Hampton G.M.
        Molecular classification of human carcinomas by use of gene expression signatures.
        Cancer Res. 2001; 61: 7388-7393
        • Segal N.H.
        • Pavlidis P.
        • Noble W.S.
        • Antonescu C.R.
        • Viale A.
        • Wesley U.V.
        • Busam K.
        • Gallardo H.
        • DeSantis D.
        • Brennan M.F.
        • Cordon-Cardo C.
        • Wolchok J.D.
        • Houghton A.N.
        Classification of clear-cell sarcoma as a subtype of melanoma by genomic profiling.
        Clin. Oncol. 2003; 21: 1775-1781
        • Milo R.
        • Shen-Orr S.
        • Itzkovitz S.
        • Kashtan N.
        • Chklovskii D.
        • Alon U.
        Network motifs: simple building blocks of complex networks.
        Science. 2002; 298: 824-827
        • Baldi P.
        • Brunak S.
        • Chauvin Y.
        • Andersen C.A.
        • Nielsen H.
        Assessing the accuracy of prediction algorithms for classification: an overview.
        Bioinformatics. 2000; 16: 412-424
        • SPSS, Inc.
        SPSS Base 10.0 User's Guide.
        SPSS, Inc., Chicago1999: 431-434
        • Wain H.M.
        • Lush M.
        • Ducluzeau F.
        • Povey S.
        Genew: the human gene nomenclature database.
        Nucleic Acids Res. 2002; 30: 169-171