If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
To whom correspondence should be addressed:Resource for Biocomputing, Visualization, and Informatics, Dept. of Pharmaceutical Chemistry, University of California, 600 16th St., M/S 2240, San Francisco, CA 94158-2517. Tel.:415-476-2299; Fax:415-502-1755;
* This work was supported, in whole or in part, by National Institutes of Health Grant P41 RR01081 from the National Center for Research Resources. This article contains supplemental Fig. 1, Movies 1 and 2, and Data 1–3. 1 The abbreviations used are:GBMglioblastoma multiformeTCGAThe Cancer Genome AtlasPDBProtein Data BankIDHisocitrate dehydrogenaseCDKN2Acyclin-dependent kinase inhibitor 2ACDK4cyclin-dependent kinase 4PDGFRAplatelet-derived growth factor receptor, α polypeptideERBB3v-erb-b2 erythroblastic leukemia viral oncogene homolog 3, also known as HER3TP53tumor protein 53PTENphosphatase and tensin homologEGFRepidermal growth factor receptorNF1neurofibromin 1FGFR1basic fibroblast growth factor receptor 1HIF-1αhypoxia-inducible factor 1, α subunit2HG2-hydroxyglutarateGOgene ontologyRPB1–12RNA polymerase subunits 1–12SEASimilarity Ensemble ApproachPEpurification enrichmentMCLMarkov cluster.
Linking proteomics and structural data is critical to our understanding of cellular processes, and interactive exploration of these complementary data sets can be extremely valuable for developing or confirming hypotheses in silico. However, few computational tools facilitate linking these types of data interactively. In addition, the tools that do exist are neither well understood nor widely used by the proteomics or structural biology communities. We briefly describe several relevant tools, and then, using three scenarios, we present in depth two tools for the integrated exploration of proteomics and structural data.
A 3-D enhanced version of this article is available. The text is identical to this version but includes interactive figures.
Viewing the enhanced version of this article requires the use of a browser plug-in. Please install the plug-in when prompted. http://www.thesgc.org/iSee/MCP/9/8/e1.html
Structural biology and proteomics provide complementary views of cellular processes. Structural biology is primarily concerned with the structures of biological macromolecules and complexes and the physicochemical interactions they support. Proteomics, on the other hand, tends to take a broader view of how proteins communicate and function within the cell, often encompassing large numbers of proteins that operate in pathways or addressing how groups of proteins work together as a function of time and/or subcellular location. (The term “proteomics,” as used here, includes studies of not only the presence and abundance of proteins under various conditions but also their interactions and their functions, both individually and as parts of larger, more complex systems.) Understanding the molecular interactions between proteins at the atomic level is of obvious utility, yet it is equally critical to understand the broader context of how pathways function and change with differing levels of expression and copy number and how they are controlled by inhibition, activation, and feedback loops.
Given the complementary nature of these approaches, it would seem natural for there to be in silico tools that support the interactive exploration of structural biology within the context of the proteome, and of the results of proteomics experiments from a structural perspective. However, although several studies that link proteomics to structure have been published (
), there are few existing tools for the interactive, integrated exploration of these complementary types of data. The structural biology and proteomics communities each have a set of commonly used interactive visualization programs, and it would be useful to investigate how these tools could work together and how they could be more tightly linked.
INTERACTIVE TOOLS FOR PROTEIN SYSTEMS DATA
Network visualization and analysis tools are commonly used to interact with proteomics data. This makes good sense: proteomics data are often associated with pathways or protein interactions, and both of these are easily visualized as networks. Even types of data not normally viewed as networks (e.g. microarray results) are often painted onto signaling, metabolic, or other pathways or protein interaction networks for visualization and analysis.
A number of visualization and analysis tools are used for protein networks. The most commonly used tool is certainly Cytoscape (
Structural visualization and analysis have a very long and rich history, and a discussion of the various molecular visualization and analysis packages is beyond the scope of this article. The most common stand-alone molecular visualization packages are PyMOL (
) are often used as web add-ons for structural visualization, and the Research Collaboratory for Structural Bioinformatics and the National Center for Biotechnology Information both have their own viewers (Protein Workshop (
) is a web service that provides the user with a protein-protein interaction network. The user may click on a node to reveal more information about the protein, including a static image of the three-dimensional structure if known. Clicking the image takes the user to the European Molecular Biology Laboratory-European Bioinformatics Institute web entry for that structure, which allows interactive visualization with Jmol.
), a plug-in to Cytoscape that loads the structures for network nodes designated by the user into UCSF Chimera for interactive three-dimensional visualization and analysis. Interaction is bidirectional so that selecting a structure in Chimera will select the appropriate node in Cytoscape.
CYTOSCAPE AND CHIMERA
As discussed above, a variety of computational tools are available for research in proteomics and structural biology, and it is beyond the scope of this article to provide a detailed comparison between them. Some of these tools may be used together to “drill down” from the proteomics, network-oriented view to a structural view. One pair of tools that may be used together in this manner is Cytoscape (
), a well-supported and widely distributed academic molecular visualization and analysis package. We explore how Cytoscape and Chimera might be used together by presenting three example research scenarios. Our focus is on the computational tools rather than on the specific data; the scenarios are based on previously published studies, and the results are not meant to represent novel findings. It is also the case that both Chimera and Cytoscape are relatively sophisticated tools with many features that may require some effort to fully master. Our intent is not to illustrate all of the features available in these tools but rather to provide examples of how they can be applied to gain insight into scientific problems. Lastly, it is difficult to convey the interactive nature of these tools using static images. To give a basic idea of the dynamic nature of the systems under study, we have provided two animations as supplemental Movies 1 and 2.
The first two scenarios deal with glioblastoma multiforme (GBM),
). The TCGA glioblastoma data set consists of three types of data: copy number variation of 17,789 genes for 206 glioblastoma cases, mRNA expression for the same 17,789 genes and 206 cases, and mutation data for 601 sequenced genes for 91 of the cases. In scenario 1, we use Cytoscape to explore a curated signaling pathway obtained from the TCGA data portal. From the GBM mutation data mapped onto the pathway, we choose one mutation of interest and drill down into Chimera to view the possible structural implications.
Scenario 2 focuses on isocitrate dehydrogenase 1 (IDH1), a metabolic enzyme that has been found mutated in glioblastoma (
). Using networks, we explore the function of wild-type and mutant IDH1 to hypothesize how the mutation might relate to glioblastoma.
Finally, in scenario 3, we look at a protein-protein interaction data set from the budding yeast Saccharomyces cerevisiae and examine how these data can support the modeling of large protein complexes (see the articles by Lasker et al. (
) that may be used to augment existing pathways with additional interaction partners. This scenario uses a curated signaling pathway provided on the TCGA data portal. This pathway represents the most frequently altered genes in glioblastoma based on the TCGA phase I data (
). In addition to the curated pathway, the TCGA data portal provides downloads of the three data sets that can be used to annotate the pathway: expression, copy number variation, and mutations. Supplemental Fig. 1 shows a screenshot of Cytoscape with the TCGA-curated pathway for glioblastoma loaded and provides a description of the user interface for Cytoscape. A Cytoscape session file with the TCGA pathway is included as supplemental Data 1.
First we explore the expression profile of each of the genes across all of the tumor patients. The differential regulation of gene expression has been associated with a large number of diseases (
) and can implicate specific genes. The usual mechanism to view differential gene expression across multiple genes and conditions is to hierarchically cluster the data and view the results as a heat map with dendrograms representing the clusters for both genes and conditions (
), where each tumor represents a different condition in this example. After annotating the TCGA pathway with the mRNA expression results, we can use the Cytoscape clusterMaker plug-in to perform the clustering (Fig. 1). As described in the TCGA 2008 report, this clustering does not lead to any obvious conclusions; that is, none of the genes are overexpressed or underexpressed in all (or even most) of the tumors.
On the other hand, looking at the clustering of tumors, we can see two broad categories: those overexpressing CDKN2A or CDK4 and those underexpressing PDGFRA/ERBB3 and CDKN2A. Although these groups are discernable, there remain certain inconsistencies within the groups that prevent a clear categorization of the tumors.
To view differential mRNA expression in the context of the pathway, we can animate the coloring of the nodes in the pathway using the “Map colors to network” capability of the clusterMaker plug-in (supplemental Movie 1), and it can be seen that for each tumor some sets of genes are either over- or underexpressed, but there is no readily discernable pattern. The lack of expression patterns might lead to an exploration of copy number variation or mutations. Like the mRNA expression data, copy number variations can be analyzed by clustering (Fig. 2), although a more detailed analysis, including structural data where available, may be required for the individual mutations.
For the mutation analysis, we used the same TCGA pathway, annotated it with the known structures for each gene product from the Protein Data Bank (PDB) (
), and imported the mutation data from the TCGA data portal. We added to the imported data an additional column to represent the percentage of tumors that were mutated for each gene. Fig. 3 is an export from Cytoscape showing each gene colored by the percentage of sequenced tumors showing mutations for that gene. Among the most strongly colored genes are TP53, PTEN, EGFR, and NF1, which have been identified previously as mutated in many tumors. Interestingly, the most frequently mutated gene, TP53, is only mutated in 34% of the tumors, and the tyrosine phosphatase PTEN is only modified in 30.7% of the tumors. One of the less frequently mutated proteins, basic fibroblast growth factor receptor 1 (FGFR1), has been well studied, and partial structures of the protein are available in the PDB. Although this protein is less frequently mutated than some of the other genes, the nature of the mutation and availability of structures provide an interesting example for our structural analysis.
FGFR1 is a receptor tyrosine kinase. Like other receptor tyrosine kinases, it comprises an extracellular ligand-binding part, a single transmembrane-spanning segment, and an intracellular part that includes a protein-tyrosine kinase domain. Ligands such as fibroblast growth factor that activate FGFR1 cause receptor dimerization and autophosphorylation across the dimer interface. Autophosphorylation shifts the kinase domain into an active state. The activated receptor goes on to bind and phosphorylate several downstream partners (
), so here we only address the mutation K656E found in the TCGA study. The structureViz Cytoscape plug-in can be used to load structures into Chimera for analysis. The two structures of interest are the structure of the FGFR1 kinase domain in the inactive state (for example, PDB code 3c4f (
). Tyr-653, Tyr-654, and the glioblastoma mutation site Lys-656 are all within the “activation loop,” which undergoes a large conformational change.
The two structures of the FGFR1 kinase domain can be superimposed in Chimera to show the magnitude of the conformational change (Fig. 4). In the activated conformation, Lys-656 is hydrogen-bonded (H-bonded) to phospho-Tyr-654 as indicated by the red dashed line in Fig. 5. Mutation of this lysine to glutamate, which is negatively charged, could mimic or interfere with phosphorylation, or otherwise affect conformation or molecular recognition. Of note, mutation from a positively charged lysine to a negatively charged glutamate gives a net change in charge of −2, about the same (depending on pH) as a single tyrosine phosphorylation. Mimicry of phospho-Tyr-654 is compelling because Glu-656 and unmodified Tyr-654 might still form an H-bond and occupy a similar volume and thus increase kinase activity in the absence of the final autophosphorylation. The negative potential from Glu-656 might also enhance earlier stages of activation.
Several types of analyses can be performed in Chimera. As alluded to above, the structures can be superimposed. In this case, the MatchMaker tool was used; it automatically pairs residues based on an initial sequence alignment and iterates the fit so that only the well-matched portions are used in the final superposition (
). Distances between activation loop residues in the two states can be evaluated as a measure of the conformational change. The change can be represented visually by not only static images of superimposed structures but as an animation generated by “morphing” between the different conformations. Chimera analyses suitable for the individual structures include finding H-bonds and contacts of residues (thus discerning their structural and functional roles and, by extension, what might happen if they were mutated) and coloring the molecular surface by electrostatic potential. Another way to assess the importance of a residue is by calculating its conservation in a sequence alignment using the Multalign Viewer tool of Chimera (
); several measures of conservation can be computed, and the alignment can be read in from an external file or generated within Chimera, for example by performing a BLAST (Basic Local Alignment Search Tool) search. Finally, a first-order, crude analysis of a mutation can be performed by “swapping” one residue type for another in the structure with the Rotamers tool. All of these analyses can be performed interactively and take at most a few seconds, and so it is quite reasonable to try out “what if” ideas in an exploratory fashion as a means of potentially generating new hypotheses.
Scenario 2: Mutations in IDH1 Associated with Glioblastoma Multiforme
This scenario begins from the structural perspective. The enzyme IDH1 has been found commonly mutated at Arg-132 in human glioblastoma multiforme (
) reveals that the mutation is in the active site of the enzyme (Fig. 5); Arg-132 forms hydrogen bonds (red dashed lines) with the substrate, isocitrate. Assays of the R132H, R1332C, and R132S mutants of IDH1 in vitro found substantial reductions in both affinity for substrate and catalysis of the normal reaction, and expression in cultured cells showed reduced levels of the product α-ketoglutarate (
). The lower levels of α-ketoglutarate led to higher levels of the hypoxia-inducible factor 1, α subunit (HIF-1α). This might explain the cancer association, as HIF-1α facilitates tumor growth under low-oxygen conditions.
) argued that a gain of function by IDH1 mutants rather than a loss is important for cancer causation and that the effect is mediated by an alternative reaction product or “oncometabolite,” 2-hydroxyglutarate (2HG). Several lines of evidence were presented. (a) Metabolite profiling identified 2HG as significantly higher in cells expressing R132H versus wild-type IDH1. (b) The R132H, R132C, R132L, and R132S mutants of IDH1 catalyzed the reduction of α-ketoglutarate to 2HG in vitro (in vivo, the remaining wild-type copy of IDH1 could supply the α-ketoglutarate). (c) α-Ketoglutarate levels were not statistically different between tumors with and without IDH1 mutations, whereas 2HG levels were higher in the IDH1 mutant tumors. (d) Patients with elevated 2HG caused by mutations in a different enzyme also have an increased risk of developing brain tumors. How 2HG itself increases cancer risk is not known, but the authors listed some possible mechanisms (
The experimentally determined structures of native IDH1 (PDB code 1t01) and the R132H mutant (PDB code 3inm) can be aligned in Chimera using the MatchMaker extension. The structure of the R132H mutant is highly similar overall to that of the wild type, with the most obvious change being the loss of the arginine side chain at position 132 and its interactions (Fig. 5). The main image in Fig. 5 shows the wild-type structure, whereas the inset shows the mutant (PDB code 3inm) with isocitrate modeled into the structure based on the MatchMaker superposition with the wild-type structure. In this modeled complex, the His-132 side chain is too far from the isocitrate to H-bond with it, as assessed by the Chimera FindHBond tool with default settings.
Several metabolic pathways involve the isocitrate dehydrogenases (besides IDH1, there are IDH2 and IDH3 isoforms), including the tricarboxylic acid cycle (also known as the Krebs cycle or citric acid cycle), the main energy-producing cycle in the cell. IDH1 is not found in the mitochondria in humans and hence is not directly involved in the tricarboxylic acid cycle. IDH1 resides in the cytosol and peroxisome and is NADP(+)-dependent. It is associated with NADPH regeneration and glutathione metabolism. We could utilize Cytoscape to place IDH1 in the appropriate pathway, but our main concern is not so much the enzyme but the oncometabolite 2HG that the mutant enzyme has been shown to produce (
). We would like to identify proteins that bind this compound to reveal (a) potential explanations of why it increases the risk of cancer or (b) entities in which mutations might lead to increased levels of 2HG and thus increased risk. The Similarity Ensemble Approach (SEA) provides a resource to search for proteins based on similarity of the compounds they bind (
), we loaded a version of the SEA network, “chembl,” into Cytoscape and selected a subset of the proteins that were returned as significant by our search. This gave us a small network for viewing some of the proteins known to bind to small molecules similar to 2HG (Fig. 6A). The chemViz plug-in to Cytoscape allows inspecting the chemical properties of the ligands of proteins selected from the SEA network. Properties of the small molecule ligands include two-dimensional structure, numbers of H-bond acceptors and donors, molecular weight, and several other widely used cheminformatics descriptors (Fig. 6C).
The five proteins in our network do not hold many surprises. Q9Y3Q0 is a membrane carboxypeptidase that releases an unsubstituted C-terminal glutamyl residue (
). This may be an interesting target for 2HG both in terms of its potential role in differentiation and its known activity in regulating angiotensin. We have not yet followed up on this finding, as our immediate goal here was to illustrate how tools can be used to explore and discover possibly interesting proteins not previously associated with glioblastoma multiforme.
Scenario 3: Using Proteomics to Inform Structural Modeling
In our final scenario, we explore the use of proteomics data as an input to modeling structures of protein complexes. The computational techniques for fitting atomic structures into density maps and the role of protein-protein interaction data are described elsewhere in this issue (Lasker et al. (
To begin our scenario, we use a protein-protein interaction data set for the budding yeast S. cerevisiae (downloaded on December 18, 2008) that contains the combined data from three previous studies by Ho et al. (
) annotations were added using the Cytoscape “Import ontology and annotations” capability. The resulting network is shown in Fig. 7A, which has nodes (proteins) colored by the GO biological process and positioned using the Cytoscape force-directed layout. A Cytoscape session file with these data is included as supplemental Data 3.
Although the network is useful, it is difficult to see the individual relationships between the proteins in a complex or to distinguish strong interactions from weaker interactions. The network can be clustered hierarchically with clusterMaker by selecting the PE scores as the source for the array data (Fig. 7). An alternative approach is to use the implementation of Markov cluster (MCL) in clusterMaker (
) to break the network into distinct groups (Fig. 8). Either type of clustering yields groups of nodes where a single group represents a set of proteins that have been found to interact and, in most cases, form a complex (
). clusterMaker links different displays of the same data so that a selection in the hierarchical clustering results will be reflected in the network view. This allows the user to interactively interrogate both sets of results.
RNA polymerase II is used to illustrate how protein-protein interaction data can feed into the structural modeling of protein complexes (Fig. 7 and Fig. 8, inset). Using both MCL and hierarchical clustering, we can quickly isolate the proteins involved in the RNA polymerase II complex. As with the previous networks, this network has also been annotated with the available PDB structures. All of the available structures include multiple chains, reflecting one or more of the components of the complex. To provide a more useful connection to three-dimensional structure, the nodes have been further annotated with the appropriate chain. With structureViz, the individual chains (individual components of the complex) can be assigned colors in Chimera (Fig. 9).
First, we loaded a structure of the yeast RNA polymerase II complex that includes all 12 subunits and a poly(A) signal in the active site (PDB code 3h3v) (
). Fig. 9 shows the structure in Chimera along with a network in Cytoscape for nine of the 12 protein subunits in RNA polymerase II. The nodes in the network have been colored to match the colors of the chains in the complex. The relative position of the nodes in the network reflects the strength of the interaction (PE score) (
). As might be expected, the PE score between RPB4 and RPB7 is very high (green and medium blue nodes near the right side of the network), and this is reflected in the structure (green and medium blue chains at the top of the structure). These two subunits bind very closely together to form a heterodimer that can be dissociated from the rest of the complex (
). This map is part of a series of three showing the conformational flexibility of the human complex.
The next step is to manually position the structure into the map and then use the Fit in Map tool in Chimera to perform a local optimization. The results can be refined further by loading the human RPB4-RPB7 complex (PDB code 2c35) (
) and superimposing those chains with the corresponding yeast subunits using MatchMaker (described above). The yeast stalk can then be hidden, leaving only the human stalk modeled together with the rest of the yeast complex (Fig. 10). Lasker et al. (
The value of combining proteomics and structural biology to understand biological processes is unquestioned. Any with remaining doubts should be convinced by the articles in this special issue. We hope we have demonstrated the value of in silico tools that interactively link these data sets to allow researchers to explore them in context and to use both types of data synergistically for hypothesis generation. The tools presented in the three scenarios are only partially linked, and we hope that increased use will allow us to understand how structural biologists will want to “zoom out” to view their structures in a pathway or interaction context and how the proteomics community will want to “zoom in” to analyze structural data.
In the near future, there are two areas in which we hope to significantly improve the combined use of these tools. First, it is currently very difficult to annotate a Cytoscape network with PDB (structure database) identifiers, which is a prerequisite for using structureViz. We hope to provide a Cytoscape plug-in (or extend structureViz) to search the PDB for each protein in a network and add the list of PDB identifiers to the node attributes.
Second, there are several useful web-based resources that are cumbersome to access from within Cytoscape. Many of these resources support a web service that allows searching for and downloading annotation data and, in many cases, downloading the results of analyses performed on behalf of the user. Integrating some of these services (for example, SEA) into Cytoscape would allow easier access.
Finally, we would like to enable structural biologists using Chimera to search for relevant pathways or networks and load them into Cytoscape. Both views should be interactive, and the structures in Chimera and corresponding nodes in Cytoscape should be automatically linked so that selections in one program will propagate to the other.
We encourage researchers who focus mainly on structural biology or mainly on proteomics to “cross the aisle” where possible and, by identifying deficiencies in existing computational tools or unmet needs, to drive the development of new and improved analysis and visualization tools. New types of data may bring new insights, and the synthesis of proteomics and structural biology will certainly yield more than the sum of its parts.