Network-based Identification of Novel Cancer Genes

Genes involved in cancer susceptibility and progression can serve as templates for searching protein networks for novel cancer genes. To this end, we introduce a general network searching method, MaxLink, and apply it to find and rank cancer gene candidates by their connectivity to known cancer genes. Using a comprehensive protein interaction network, we searched for genes connected to known cancer genes. First, we compiled a new set of 812 genes involved in cancer, more than twice the number in the Cancer Gene Census. Their network neighbors were then extracted. This candidate list was refined by selecting genes with unexpectedly high levels of connectivity to cancer genes and without previous association to cancer. This produced a list of 1891 new cancer candidates with up to 55 connections to known cancer genes. We validated our method by cross-validation, Gene Ontology term bias, and differential expression in cancer versus normal tissue. An example novel cancer gene candidate is presented with detailed analysis of the local network and neighbor annotation. Our study provides a ranked list of high priority targets for further studies in cancer research. Supplemental material is included.

The function of a protein can be expressed in terms of its interactions with other molecules. All interactions between all proteins define the "protein interactome," i.e. the complete interaction network of the proteins of an organism. These networks form the backbone of molecular pathways and cellular processes. Thus, the construction of interaction networks will shed light on many aspects of the dynamic and interactive function of human proteins.
Several efforts in reconstructing the human interactome are ongoing. Interactions may be measured directly with high throughput yeast two-hybrid or pulldown assays (1,2). Experimental interactions have been collected from multiple sources to build large interaction networks (3)(4)(5). The network can be augmented considerably by inferred interactions either in the same or from other species (6 -10). The largest predicted human interactome is currently provided by FunCoup (11), which uses eight types of evidence and transfers interactions extensively from model organism orthologs.
The development of new therapeutics and diagnostics rely on the understanding of disease mechanisms. Therefore, the identification of novel disease-associated genes is of great importance. Disease genes have traditionally been found by genetic linkage analysis or gene association studies, but this is very time-consuming and costly and often fails due to lack of data. Particularly for complex diseases involving many genes, these methods are unreliable (12).
Bioinformatics methods can be used to accelerate disease gene discovery either based on gene annotation and sequence features (13,14) or based on network analysis (15)(16)(17)(18)(19). The network-based methods normally connect gene networks with phenotype networks to infer gene-disease relationships. These works, however, are limited to using only direct interaction data and/or were only applied to rank a short list of candidate genes in a genomic interval.
Here we describe a new generic network-based approach, MaxLink, for predicting novel candidate members to known biomolecular processes and pathways. A typical application is the identification of new disease genes based on a set of known disease genes. We applied MaxLink to the human interactome generated by FunCoup to screen for new cancer genes. To seed the screen, we compiled a list of 812 known cancer genes, 364 from the Cancer Gene Census (20) and 448 genes from text mining.
MaxLink assigns a score to every new candidate gene based on the number of links to a seed set. We show that the maxlink score is a useful indicator of candidate reliability by three types of validations: cross-validation, differential cancer expression, and GO 1 term analysis. The screen resulted in nearly 2000 candidates of which nearly 200 are connected to over 10 known cancer genes. These genes have, to our knowledge, no clear former evidence supporting association with cancer. However, their network connection to cancer genes makes them worth particular focus when developing biomarkers or studying oncogenesis. As the candidate list is long, it makes sense to explore the top ranking genes first.

MATERIALS AND METHODS
Retrieval of Known Cancer Genes-The input data set of known cancer genes was collected from Swiss-Prot (21) (20). The Swiss-Prot genes were identified by searching annotations in the CC field, which represents curated annotations and includes a subcategory for annotations indicating disease involvement. The disease annotations of the CC field were matched against cancer-specific terms (see supplemental Table 4), and genes for which a match could be found were added to the set of known cancer genes. Genes and matching keywords are detailed in supplemental Table 1.
GO Analysis of Known Cancer Genes-The Gene Ontology functional term analysis was done using the amiGO web site. Enrichment analysis of terms in the major cluster (348 genes) versus UniProtKB (20,740 genes) resulted in a total of 231 terms with p Ͻ 10 Ϫ2 . This list was abbreviated by requiring p Ͻ 10 Ϫ10 and enrichment Ͼ5, resulting in 34 GO terms (supplemental Table 2).
Network-based Identification of Candidate Genes-We used the human FunCoup protein network (11) to identify network neighbors to the previously retrieved input genes. Only links with a confidence value Ͼ0.75 were considered. Each candidate gene was assigned a maxlink score for ranking that equals the number of linked known cancer genes.
Annotation Filter-To identify genes with possible cancer annotations, the complete UniProt (22) DE, KW, CC, and FT fields as well as reference titles were searched for cancer-specific text terms (supplemental Table 4). Genes with a match were excluded from the candidates list. Additionally, genes with a gene identifier not found in the current version (version 51) of Ensembl (23) were also excluded.
Connectivity Filter-If the majority of a candidate's connections were to non-cancer genes, it was deemed of low cancer specificity and was rejected. For this analysis, we divided all genes into two sets: 1) the known cancer genes plus all genes with any cancer annotation (see "Annotation Filter" above) and 2) all other genes. The gene counts of these sets were 4953 and 12,198. Consequently, genes exhibiting over 2.46 times more links to genes not associated with cancer than to the known cancer genes were removed.
Differential Expression in Human Protein Atlas-We devised a score (differential expression score (DE)) for differential protein expression levels in 18 different cancer types relative to their normal tissue counterparts (see Table I) from the 3.0 version of the Human Protein Atlas. DE was calculated by subtracting the average expression in a normal tissue from the average expression in the corresponding cancer tissue for each gene and tissue. To avoid tissuespecific biases, raw DE values for each tissue were replaced by Z-scores based on the expression distribution of each tissue. A Z-score of 1 represents one standard deviation above the mean. Finally, the total DE for each gene was calculated by taking the average of all absolute DE values for all 18 tissues.
Analysis of Cancer-associated GO Terms-GO terms for all genes were retrieved from Ensembl via BIOMART (23), and the terms were expanded to include all higher level terms. All GO terms for the set of known cancer genes were tested for significant enrichment (fold change) with a hypergeometric test. The set of cancer-associated GO terms was then tested for significance (p Ͻ 0.05) for subsets of the candidates composed of all genes having a number of linked known cancer genes above or equal to a cutoff defining that subset. Relative fold changes for subsets were subsequently calculated for each GO term by taking the logarithm of the subset fold change divided by the fold change of the same term for the known cancer genes. RESULTS We have developed an analysis pipeline to identify and rank candidate cancer genes based on their connectivity to known cancer genes in the FunCoup network. By "known cancer gene," we mean any gene with clear evidence for cancer involvement. To analyze the interconnectedness and clustering of the known cancer genes, we first explored their network topology. Then, using them as seeds, we extracted candidate novel cancer genes and refined this list by applying quality filters. Finally, to validate our approach, we used three types of independent validation tests: cross-validation, enrichment of cancer GO terms, and differential expression in cancer versus normal tissue.
New Compilation of Known Cancer Genes-Our approach starts with collecting known cancer genes. In a previous survey, the Cancer Gene Census, Futreal et al. (20) identified 364 cancer genes. By text mining Swiss-Prot for genes annotated to be involved in cancer, we identified 703 genes. Merging this list with the Cancer Gene Census resulted in 812 unique cancer genes (supplemental Table 1).
To analyze this set of genes in terms of network structure, we examined how they cluster into interconnected modules. This revealed one major component with 348 members, 12 small clusters with 2-9 members, and 429 singletons as shown in Fig. 1. Thus, 43% of the known cancer genes were interconnected in a single subnetwork that should represent processes central to cancer. To verify this, we analyzed enrichment of functional annotation terms in the Gene Ontology database (24) relative to all human genes. We observed strong enrichment (Ͼ5-fold enrichment, p Ͻ 10 Ϫ10 ) for terms such as DNA repair and replication, cell cycle regulation, and apoptosis (supplemental Table 2). This is well in line with known cancer-associated processes.
Screen for Candidate Cancer Genes-The FunCoup network was used to retrieve 4049 potential candidates con- nected to known cancer genes by high confidence links. Because our aim was to find genes previously not associated with cancer, the list was further refined by a number of filters. In the first step, 1511 genes that had any annotation suggesting a potential association with cancer were removed from the list. Because FunCoup was built using data sets of which some were linked to earlier versions of Ensembl, 254 genes were removed to ensure that the candidates are in sync with the current version. This constitutes a broad filter and would likely remove genes with only spurious cancer association. Hublike genes might be spuriously linked to many known cancer genes solely because they have many links and not because they are involved in cancer. Thus, 393 genes were removed in the second step because they had fewer links to known cancer genes than expected by chance given their connectivity in the entire network. Such genes may have been found simply because they are highly connected and not because of a preferential association to the known cancer genes. A schematic representation of the analysis work flow is shown in Fig. 2. After all filters, a final list of 1891 candidates remained with a maxlink score (links to known cancer genes) between 55 and 1 (see supplemental Table 3).
Validation by Cross-validation-If our method works well, it should be able to detect the known cancer genes in a crossvalidation test. We ran MaxLink five times, leaving out 20% of the known cancer genes each time. By doing so, we were able to identify 41.7% of the removed genes on average. However, only 47% of the known cancer genes had links to other input genes; thus, the obtained retrieval is close to the theoretical maximum restricted by the network. As we cannot assess false positives directly, we instead looked at enrichment of the removed known cancer genes among the retrieved genes, i.e. their frequency in the retrieved set relative to their frequency in the entire database. The average enrichment for all removed genes was more than 5-fold (p Ͻ 10 Ϫ25 ). However, this increased to over 12-fold for the genes with highest maxlink score (see Fig. 3).
Validation by Differential Cancer Expression-The Human Protein Atlas (HPA) (25) contains protein expression in both normal tissues and cancers taken from a large number of tissues. Using these data, we calculated a normal versus cancer DE for the 411 candidate cancer genes with expression data in HPA using all 18 cancer types.
To examine the impact of a high maxlink score, we looked at the fraction of genes with DE above 1, i.e. when the differential expression exceeds one standard deviation on average for all cancer types. As seen in Fig. 4, this fraction increased for subsets of the candidates consisting of genes with a higher maxlink score and was considerably enriched compared with the known cancer genes and HPA as a whole. This  Minimum maxlink score of retrieved genes Enrichment 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18

FIG. 3. Enrichment of known cancer genes in cross-validation testing of MaxLink method.
The enrichment is the frequency of cancer genes in the retrieved set relative to their frequency in the entire database. Overall, the enrichment of cancer genes with one or more links is just above 5. Restricting the retrieved set to genes with a higher maxlink score produces a proportionally increased enrichment. The enrichment levels at high maxlink scores are somewhat variable due to small amounts of data.
indicates that candidates linked to a high number of known cancer genes are likely important for cancer.
To investigate whether this trend is caused by a decrease of normal tissue expression or an increase in cancer expression, we plotted the absolute expression levels as a function of maxlink score (see Fig. 5). The overall trend is an increase of both normal and cancer tissue expression but with a relatively higher increase in cancers. Thus, the MaxLink approach can enrich genes differentially expressed in a wide range of cancers, and the maxlink score is a useful indicator of cancer relevance.
We noted that the average DE of the known cancer genes was only slightly higher than that of HPA. This can be explained by the fact that HPA was started with a strong cancer focus and is highly enriched for cancer genes.
Validation by GO Terms-If our candidate cancer genes would show the same GO term enrichment as the known cancer genes, this would give further support to their relevance in cancer. To investigate this, we retrieved all GO terms for the known cancer genes and tested for a significant enrichment. Of a total of 4281 terms, enrichment greater or equal to 2-fold was significant at the 0.05 level for 1716 terms.
These cancer-associated GO terms were subsequently tested for enrichment in subsets of the candidate genes grouped by increasing numbers of links to known cancer genes. The average enrichment increased proportionally to the number of linked known cancer genes (Fig. 6), showing that genes more central in the cancer network are more functionally associated with cancer.
Novel Candidate Cancer Genes-Our screen resulted in a list of 1891 novel candidate genes (supplemental Table 3). Given the above validations of the maxlink score as an indicator of cancer relevance, it makes most sense to focus on those candidates with the most linked known cancer genes. The list contains 185 candidates with 10 or more linked known cancer genes, and these should perhaps be seen as the most urgent targets for focused cancer studies.
To illustrate how a candidate cancer gene may be analyzed further, we chose an example, RPA1 (UniProt accession number P27694), which is a DNA-binding subunit of replication protein A. It was functionally coupled to 34 known cancer genes (see Fig. 7).
Can we predict what cancer type RPA1 is most likely to cause or be associated with? According to HPA, RPA1 is expressed in all cancer types and has differential (͉DE͉ Ͼ 1) expression compared with normal in seven tissues (colorectal, endometrial, head and neck, pancreatic, skin, testis, and urothelial cancer). Thus, even if RPA1 may play a more prominent role in some cancers, it is likely to be a universal cancer gene.
The network neighbors of RPA1 have diverse differential expression patterns, supporting the notion that it is not specific for a certain type of cancer. In Fig. 7, the KEGG pathway member- FIG. 5. Relative protein expression levels of candidate cancer genes in cancer and normal tissues compared with HPA as a whole. The average expression in both cancer (circles) and normal (triangles) tissues was calculated for candidate subsets, binned by maxlink score, and normalized by subtracting the average expression of all genes in HPA for cancer and normal tissues, respectively. The relative expression levels are not strongly correlated with maxlink score, but the difference between cancer and normal expression (diamonds) is. The prevalence of high differential expression at high maxlink scores, as seen in Fig. 4, thus cannot simply be explained by increased cancer expression or decreased normal expression. The expression levels are discrete as used by HPA (29): 1 represents none, 10 represents low, 50 represents moderate, and 250 represents high expression. Expression in both normal and cancer tissues is generally increased for genes with higher maxlink score but with a relatively higher increase for cancer tissues.
ship of the neighbors is indicated. Although this gives a very incomplete picture because of the low coverage of KEGG for cancer (for instance, breast cancer is absent), it does reveal several cancer types such as colorectal and pancreatic.
A literature search revealed that in mice RPA1 has been shown to cause defects in DNA double strand break repair, which can lead to leukemia (26). (This information is not present in UniProt.) In human, RPA1 is located in chromosomal region 17p13.3, which has been implicated in e.g. colorectal and breast cancer (26). These cancers had strong support by DE in HPA for both RPA1 and its neighbors as well as from annotation of many of the neighbors.
Our analysis based on HPA expression, the gene subnetwork, and literature reinforces the connection between RPA1 and cancer, lending support for the cancer types associated with the RPA1 locus but suggesting that it may cause cancer in any tissue. Although it was one of the top ranking novel cancer gene candidates, there is no mention of any cancer association in UniProt or HPA. However, the presented evidences support that it plays a central role in tumorigenesis. DISCUSSION We have described a general network-based approach for identifying and ranking candidate novel genes relevant to a process or disease and have applied it to find novel cancer genes. The validations carried out show that the ranked list produced by our method is enriched for true cancer genes.
Cancer is in this study treated as one disease. This is obviously a simplification but is based on the notion that cancers originating in different tissues are often caused by perturbations in the same pathways, for instance DNA repair, cell cycle regulation, or apoptosis. It is supported by the fact that the network of known cancer genes only formed one large cluster (Fig. 1), which did not show very distinct subclusters. Also, Goh et al. (27) showed that different cancers are often caused by the same genes. The attractiveness of this approach is that genes found in most cancers have a greater potential for diagnostic and therapeutic value.
Because of the modularity of the MaxLink pipeline, other diseases or processes can easily be investigated. A more traditional approach to disease gene hunting is linkage analysis where the gene associated with a disease is known to be found in a genomic interval that can contain in the order of a hundred genes. MaxLink could also be used to prioritize genes for such projects as long as a fair number of genes are already known for the disease in question. The main The relative fold change of cancer-related GO terms is shown for candidate cancer genes linked to a certain number of known cancer genes. The relative fold change of each cancer term is the base 2 logarithm of the fold change of the subset divided by the fold change in the known cancer genes. A relative fold change above 0 means that the cancer terms are more enriched in the candidate subset than in the set of known cancer genes. Candidates linked to more than five known cancer genes have a fold change of cancer terms on average greater than the known cancer genes.
advantage compared with other methods would be the richness of evidence integrated in the FunCoup links. For short lists, it may be necessary to lower the cutoff compared with this study to obtain a reasonable amount of links.
Many of our candidate genes were supported as cancer genes by the HPA database. However, even if a gene does not have differential cancer/normal expression in HPA, this does not disprove its potential implication in cancer. The protein level changes associated with the tumor progression may be too subtle to detect with the HPA technology. However, the predicted functional coupling to a cancer pathway is still valid, and the gene in question may well turn out to be a useful marker or therapeutic target. The "total differential expression" measure used here was only intended to investigate the validity of the MaxLink approach and not as a definite indicator of cancer relevance. Because we average across all cancer types, a gene differ-entially expressed in only one or a few cancer types might receive an unfairly low total DE value.
The main result of this study is the ranked list of novel cancer gene candidates. The 185 candidates connected to 10 or more known cancer genes are prime targets for new experiments that will lead the way to better understanding cancer. Some of the candidates with a lower maxlink score may also develop into important cancer biomarkers or targets, but there is a rationale for focusing on the high scoring genes. A high maxlink score is an indication that the candidate acts as a hub and plays a central role in the process and is more likely to be of widespread importance in many different tumors. On the other hand, such functions are typically also important for healthy tissue homeostasis and may be unsuitable as targets for inhibition. An exception to this would be hubs that act in parallel in healthy tissue but only one is functional in a tumor. Such a situation would make a hub an excellent cancer-specific drug target. FIG. 7. FunCoup subnetwork of candidate cancer gene RPA1 surrounded by functionally coupled known cancer genes. RPA1 is the yellow circle, whereas the known cancer genes are colored/shaped according to KEGG (30) pathway membership. Note that KEGG contains a relatively small number of cancer pathways; hence, most genes are not assigned to any pathway (gray balls). All green genes are from cancer pathways, however. The figure was made using the jSquid applet (31).
Acknowledgments-We thank Vincent Devloo, Olof Karlberg, and Andrey Alexeyenko for assistance and helpful discussions. We also thank the Swedish Human Proteome Resource program for access to the protein atlas data.