PROTEOME-3D: An Interactive Bioinformatics Tool for Large-Scale Data Exploration and Knowledge Discovery*

Comprehensive understanding of biological systems requires efficient and systematic assimilation of high-throughput datasets in the context of the existing knowledge base. A major limitation in the field of proteomics is the lack of an appropriate software platform that can synthesize a large number of experimental datasets in the context of the existing knowledge base. Here, we describe a software platform, termed PROTEOME-3D, that utilizes three essential features for systematic analysis of proteomics data: creation of a scalable, queryable, customized database for identified proteins from published literature; graphical tools for displaying proteome landscapes and trends from multiple large-scale experiments; and interactive data analysis that facilitates identification of crucial networks and pathways. Thus, PROTEOME-3D offers a standardized platform to analyze high-throughput experimental datasets for the identification of crucial players in co-regulated pathways and cellular processes.

mass spectrometry (MS/MS) identification of expressed proteins in complex mixtures.
Predictably, technological advances enabling highthroughput analysis have resulted in an accumulation of experimental data at a rate far exceeding the current ability to assimilate that data. Transforming the rapidly proliferating quantities of experimental data into a usable form in order to facilitate data analysis is a challenging task. Numerous specialized databases and graphical tools have been described to organize the growing collection of largescale experimental datasets (7)(8)(9)(10)(11)(12)(13)(14)(15)(16). These tools have made significant contributions toward functional data organization and the display of protein complexes and hierarchical relationships. Yet the initial interpretation of experimental datasets in an interactive and intuitive way remains a challenge. Important functional information can only be determined through careful and detailed analysis of experimentally identified and quantified data in the context of the current knowledge base. Functional analysis, which is requisite to an exhaustive understanding of cellular networks and pathways, represents a major bottleneck in proteomics today. It is recognized that bridging the expansive gap between the current state of knowledge and the ultimate goal of understanding whole cellular networks requires a global discovery phase to pinpoint pivotal proteins in cellular networks (17). Tools that integrate diverse experimental results with the current knowledge base would undoubtedly facilitate the understanding of biological networks and pathways. Visualization of biological data is an important component of such applications (18).
We describe here a Web-based data exploration and knowledge discovery tool called PROTEOME-3D that utilizes three essential features for effective assimilation and analysis of large-scale experimental datasets: 1) automated construction of a customized database of expressed proteins/mRNAs from the public knowledge base using user-defined criteria; 2) graphical tools for displaying and comparing experimental results in the form of proteomic landscapes; 3) an interactive user interface for in-depth analysis of experimental results. Sample applications are provided to demonstrate how this tool can facilitate the evaluation of experimental results. (For information on how to obtain a copy of PROTEOME-3D, contact David K. Han at han@nso.uchc.edu.)

EXPERIMENTAL PROCEDURES
Information Flow-The general flow of information through PRO-TEOME-3D is outlined in Fig. 1. Experimental results generated from isotope-coded affinity tag (ICAT) analysis or from cDNA microarrays are pre-processed to create an input file of protein identities (ids) and abundance ratios (see "Database" subsection below for more detail). Protein ids are then used to generate a customized, user-defined dataset from public databases, and the combined experimental and retrieved data are stored in a local database. The PROTEOME-3D graphical interface is accessed through Internet Explorer. Three-dimensional (3D) display and protein page screens are linked for easy navigation, and each screen communicates with the local database through a servlet stored on the server (19). The protein page provides user-selectable links to public and/or proprietary databases and the capability to construct additional customized links.
Database-Experimental results, together with a customized dataset retrieved from public databases, are stored locally in a relational database (Oracle 9i). For each experiment loaded in the database, a list of MS/MS-identified proteins and their calculated abundance ratios is initially read from an INTERACT summary web page, which contains one row of data for each peptide scan conclusively identified by SEQUEST and quantified by XPRESS (20,21). Alternately, microarray output identified by gene ids and stored in a tab-delimited file is read in a pre-processing step, and a file of corresponding protein ids and abundance ratios is produced. A series of Java application programs are then executed, resulting in population of the local database with the experimental results and desired user-defined data. At a minimum, the experiment table contains, for each experiment, a nested table of entries comprising protein accession number, proportion, and standard deviation of the proportion. The proportion attribute is a mean value derived from the abundance ratio as follows: for each identified peptide in a given experiment, the peptide's abundance ratio A:B is used to calculate a proportion B/(AϩB), where B is the relative abundance of the peptide in the experimentally perturbed sample and A is its relative abundance in the control sample. Then, from all the peptide samples for a given protein, an average proportion and standard deviation are computed and stored in the nested table entry. This value serves as a normalized representation of abundance ratio, providing comparable values for comparisons between up-and down-regulated proteins, with values ranging from 0 for maximum down-regulation to 1 for maximum up-regulation. Additional information pertinent to each experiment can be stored in the experiment table as well.
Detailed information retrieved from public databases for each protein is stored in the protein table. An example of a set of stored attributes includes National Center for Biotechnology Information (NCBI) Protein Data Bank accession number, molecular weight, pI, cross-references to NCBI nucleotide and Online Mendelian Inheritance in Man (OMIM) databases, maploc, and a variety of descriptive fields such as keywords, definition, function, disease, subcellular location, and pathway. To avoid database redundancies and changing accession numbers, to retrieve latest annotations, and to ensure that the local database accurately reflects latest updates, it is necessary to routinely download the latest publicly available databases and subsequently update the local Oracle database.

Data Presentation
3D Graphic Display-An interactive Java3D graphic display represents a given experiment as a set of cone objects (Fig. 2, upper left). Each cone depicts a protein, uniquely identified by its mass (x-axis) and pI (z-axis). The abundance ratio (converted to a proportion, as described above) is graphed on the y-axis and corresponds to the height of the cone. The base of each cone sits on the plane of y ϭ 0.5 (green), which represents a 1:1 abundance ratio. Cones depicting up-regulated and down-regulated proteins project above and below the reference plane, respectively. Cone color is mapped to the interval (0, 1) on the y-axis, with blue, green and red representing 0.0, 0.5, and 1.0, respectively. Bright red cones, then, denote highly up-regulated proteins, and bright blue cones denote those proteins that are highly down-regulated. For specific information on an individual protein, its cone can be selected by mouse click, which highlights the cone and displays a subset of the protein's attributes in a text field along the bottom of the screen. Next to the text field is a button linking the highlighted protein to the protein page screen, which interfaces with the local database and a number of external data sources. Additional buttons at screen bottom allow selection of alternate experimental displays. Global scene manipulation (zooming in or out, rotation, and scene translation up, down, left, or right) is executed through mouse buttons and the Alt key.
To the right of the graphic display is the content pane, which provides the functionality for experiment-to-experiment FIG. 1. Information flow through PROTEOME-3D, from data generation through processing, storage in the local database, and display via graphical user interfaces. Multiple tab-delimited proteomics or microarray profiling experimental results can be used as input for PROTEOME-3D. comparisons. A second experiment is chosen for comparison with the current experiment from a drop-down list under the heading "Select 2 nd Exp." Optionally, a separate color can be chosen to represent each experiment ("Exp Colors"), enabling cones of the two experiments to be easily distinguished in the graphic display. Desired comparisons are executed through radio buttons under the heading "Select Function," with options including intersection, union, and complements. An intersection between two experiments is illustrated in Fig. 2, middle and lower left, where Exp1 cones are displayed in violet and Exp2 cones are displayed in green. The detail image of the intersection (Fig. 2, lower left) shows how the inner cone is visible through the transparent outer cone, allowing a visual comparison of the protein's expression in the two experiments.
Protein Page-The protein page screen (Fig. 2, lower right) provides the functionality for in-depth analysis of a selected protein. It comprises an interface to the local Oracle database, a customizable set of links to user-selected external Web sites, and a query-building tool for use with the local database. The protein page is accessible from the graphic display screen by a button click, as described in the previous paragraph. The screen is functionally divided into two sections. The top section displays attributes of the current protein that are stored in the local database, including comments that can be selected by name from a drop-down list. Additional information about the protein is accessible through customized links, which are defined on a pop-up window where the user can select from default sites or input his/her own URLs. The bottom section of the protein page screen comprises a querybuilder tool, with a button linking back to the display page, allowing results of database queries to be viewed as a 3D graphic display. Fields, operators, and connectors for a query are selected from drop-down lists; the search key is typed into a text box. Subqueries can be combined, using an appropriate connector, until the desired query is complete. After execution, query results can be viewed in a text box or as a 3D graphic display. Additional clauses can be added to the existing query, facilitating a natural progression to increasingly more specific subsets of data, or, alternately, the user can begin again with a new query.
Sample Applications-PROTEOME-3D provides a visual summary of potentially massive sets or selected subsets of experimental data, including experiment/experiment comparisons; an advanced querying capability on locally stored data; and direct exploration of experimental results in the context of the public knowledge base. This combination of features allows efficient navigation through vast quantities of data and extraction of subsets of experimental proteins with unique properties relevant to the area of research. The chosen subset can then be viewed graphically and further analyzed by accessing the public databases, or by experiment/experiment comparison, in which common or unique proteins from a second profiling experiment are jointly displayed in the 3D graphical format. The following examples, using data from ICAT proteomics profiling experiments (20, 25), 2 demonstrate specifically how this tool can help analyze large-scale protein profiling experiments systematically and efficiently.
Example 1: Analysis of Membrane Proteins That Are Differentially Regulated During Apoptosis-Apoptotic cells are rapidly recognized and engulfed by neighboring cells and macrophages. This recognition process is theoretically a summation of changes in adhesive and repulsive signals presented on the apoptotic cell surface (22). In an effort to identify adhesive engulfment ligands, we profiled membrane proteins from control Jurkat cells and anti-Fas immunoglobulin M-treated apoptotic Jurkat cells, using a previously described method (20). We then manually explored the experimental results, together with pertinent public databases, searching for candidate proteins for further study. In fact, the impetus for creating an automated discovery tool came from this and similar timeconsuming, manual searches of databases and experimental results. Here we explore the utility of PROTEOME-3D in the analysis of this dataset. Briefly, proteins positively identified (with p Ͼ 0.9) and experimentally quantified from the INTER-ACT page (Fig. 3A) were loaded into the PROTEOME-3D local database, automatically providing three inter-related tools for systematic and efficient analysis: 1) the local database itself, storing information retrieved from a variety of public databases, together with experimental data; 2) the interactive 3D graphical display of experimental results; and 3) a protein page linked to the 3D display, providing an interactive, queryable interface to the local database as well as user-defined links to publicly available network and pathway databases. These features allow rapid analysis of proteins for putative engulfment ligands. Among the 114 proteins identified and quantified from the Jurkat experiment, one stands out immediately in the 3D graphic display of experimental results ( switch to the protein page for further exploration of this protein (Fig. 3C). Its keyword annotation lists calcium/phospholipid-binding as a known function, and recent literature cited in the OMIM link implicates annexin 1 in anti-inflammatory activity (23,24). These features suggest a possible role for human annexin I in the apoptotic cell engulfment process and anti-inflammatory activity. Due to its significant up-regulation in the graphic display and its membrane-binding activities, this protein is quickly recognized as an intriguing candidate for further study. In fact, additional experiments were performed on annexin 1 (as a result of the original manual discovery process), and its function as an endogenous engulfment ligand was discovered (25). Thus, the ability to graphically visualize the proteome landscape of a profiling experiment and quickly access known biological activity allows the user to focus on candidate proteins that are most relevant to the particular biological system.

Example 2: Analysis of Membrane Proteins from Prostate Cancer Cells That Are Regulated by Treatment with Androgen
Homolog R1881-Selection of a smaller subset of proteins from a large number of experimentally identified and quantified proteins or cDNAs requires efficient and systematic analysis of the experimental dataset. Here, we describe the use of PROTEOME-3D to analyze experimental ICAT profiling data from a prostate cancer cell line (LNCaP) treated with or without androgen analog R1881 for 24 h. The entire experimentally determined dataset includes 4052 proteins displayed in Fig. 4A. It is well documented that multi-step carcinogenesis requires sequential aberrations in chromosomal DNA, and a large number of chromosomal changes associated with prostate cancer have been described (Table I). The ability of PRO-TEOME-3D to automatically retrieve user-relevant data for local storage makes chromosomal localization of all of the identified and quantified proteins a queryable feature. Querying all proteins mapped to any of the loci listed in Table I results in the selection of 921 proteins. This subset is then further refined using the functional term "Oncogene," resulting in a subset of 31 proteins. Adding to the query the more specific term "Tumor suppressor" results in the two downregulated and three up-regulated proteins responsive to androgen treatment composing the final subset. This progressive refinement of the search criteria, along with a graphic display of the resultant subset, is depicted in Fig. 4B. Now a much more manageable set of candidate proteins can be further analyzed via the protein page (Fig. 4C), allowing for additional queries and/or in-depth analysis of pathway and network databases. For example, one of the up-regulated proteins, ATM_HUMAN, is selected from the graphic display by mouse click and its links are followed to GENEGO and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway databases (Fig. 4C, lower left and right, respectively). Thus, interactions and networks associated with this experimentally identified and quantified tumor suppressor protein can be easily explored from the PROTEOME-3D platform, allowing easy integration of experimental data with the current knowledge base.
Example 3: Analysis of Coregulation of Cellular Processes-Many biological processes are controlled by multi-protein complexes whose properties, such as abundance and subcellular location, are integral to that control (26). Exploring coregulation of proteins categorically-proteins associated with particular functional groups or particular subcellular locations, for example-provides crucial biological information. This type of analysis can be easily done with PROTEOME-3D, using the query builder tool and displaying the results graphically (Fig. 5). We functionally categorized proteins from the LNCaP profiling experiment as mitochondrial, glycolytic, housekeeping, and structural by querying locally stored comment and keyword fields. The resulting graphical displays, which summarize the profiles of proteins associated with these functional groups, are shown in Fig. 5A. In Fig. 5B, the experimental results are categorized by subcellular location into the four broad groups of extracellular (secreted), cytoplasm, membrane, and nucleus. In each of the displays, highly up-or down-regulated proteins are easily distinguished by their corresponding bright red or blue color and can be easily selected for further analysis.
Example 4: Analysis of Multiple Profiling Experiments-The efficient evaluation of multiple, large-scale expression analyses, time-dependent changes in expression, and protein profiles generated by diverse methodologies requires a software platform that can simultaneously display multiple experiments for analysis. We have implemented a unique component of PROTEOME-3D, termed "Multi-Experiment Comparison," where multiple proteome, cDNA microarray, or combinations of profiling experiments can be efficiently analyzed. The first example (Fig. 6, A-C), comparing datasets extracted from Jurkat and LNCaP cell lines described above (Example 1 and Example 2), demonstrates salient features of this tool. Fig. 6A illustrates the intersection of proteins identified in each experiment, allowing a visual comparison of expression patterns for proteins common to the two cell lines. Zooming in, as shown previously in Fig. 2 (lower left), allows the experimentalist to discern similar or divergent patterns of regulation of a particular protein between the two experiments. Fig. 6B illustrates the Complement1 function, depicting proteins identified in Exp1 (Jurkat cell line) but not in Exp2 (LNCaP). Additional functions (not shown) include a display of proteins uniquely identified in Exp2 (Complement2), a combination of Complements 1 and 2 (Complement), and all proteins identified in either experiment (Union). Subsets of the data can also be easily extracted for comparison, as in Fig. 6C, where expression profiles of cytoplasmic proteins common to both experiments (Intersection) are displayed. PROTEOME-3D's multi-experiment comparison can also be used to analyze a cDNA microarray dataset concurrently with an ICAT proteome profiling experiment, as shown in Fig.  6, D-F. LNCaP cells treated with androgen analog R1881 were used to isolate mRNAs and proteins for comparative analysis. The regulation of 56 genes is displayed in Fig. 6D, where specific gene accession numbers have been converted to Swiss-Prot loci in order to compute molecular weight and pI for the 3D graphical display page. Subsequent comparison of 56 genes with 2294 proteins from the ICAT experiment, shown in Fig. 6E, results in 25 common proteins. Using this feature of PROTEOME-3D, the investigator can easily explore and analyze co-regulated and differentially regulated mRNAs and proteins (Fig. 6F). DISCUSSION Systematic and efficient analysis of vast genomic and proteomic data sets is a major challenge for researchers today. Crucial biological advances in the study of model organisms are made daily, and new information is continuously deposited in publicly available databases. Thus, for each protein or mRNA that is identified and quantified from expression profiling experiments, a wealth of biologically relevant information, such as associated biological networks and pathways, protein interaction partners, biochemical activities, pathological/disease association, subcellular and tissue-specific expression, domain structure and function, may exist in publicly available databases. These datasets can be efficiently utilized by experimentalists to decipher complex functions of proteins if the crucial information is easily accessible. Yet due to the profound differences among the various biological databases housing such diverse data, information retrieval is an overwhelmingly manual, rate-limiting step in the researcher's analysis of experimental results, making integration of existing biological data a critical problem (40). Thus, to systematically and efficiently evaluate large-scale experimental results in the context of existing biologically relevant data, at least four crucial features are required: 1) automatic retrieval of userdefined information to construct a customized, queryable database; 2) an intuitive graphical and query platform to display and analyze experimental data in the context of the customized database; 3) efficient utilization of web-based bioinformatics software tools for data interpretation, prediction of function, and modeling; and 4) scalability and reconstruction of the database in response to changing user needs and an ever-expanding base of knowledge and bioinformatics tools.
Creating a software tool to encompass the four crucial features outlined above is a challenging and ongoing task, particularly with respect to the ever-expanding publicly available base of knowledge and bioinformatics tools. PRO-TEOME-3D represents an initial attempt to automate the laborious and time-consuming process of experimental data analysis in order to efficiently identify the most salient features and evaluate those features in the context of the existing knowledge base. It is a platform built upon the integration of three related components: 1) a scalable, queryable, customized relational database for local storage of user-defined biologically relevant data; 2) an intuitive and interactive 3D display for evaluating and comparing experimental results; and 3) an interactive user interface for systematic analysis of experimental data. We have demonstrated with specific examples how PROTEOME-3D can be effectively utilized for large-scale experimental analysis. The interactive 3D proteomic landscape provides a striking visual overview of experimental results whose notable features can then be further analyzed through integration of the display with the local, queryable database of biologically relevant data. A customizable set of links to user-selected external sources expedites concurrent utilization of those web-based tools of particular relevance to the analysis. This ability to link PROTEOME-3D with a number of web resources is especially critical for in-  Table I; Query 2 refines the first query with the keyword "Oncogene"; Query 3 further refines the query with the keyword "Tumor suppressor." C, Locally stored data for ATM_HUMAN, highlighted in B, is displayed on the protein page (top); portions of associated regulatory pathways from GENEGO and KEGG databases, accessible through links on the protein page, are displayed in the bottom panels. vestigation of protein families, splice isoforms, and multiple redundant entries in the databases. For example, exploration of a regulated protein/cDNA may require detailed analysis of multiple splice isoforms expressed in that tissue. Thus, the ability to create customized links and retrieve user-defined data is essential for detailed data exploration and knowledge discovery. We intend to expand the capabilities of PROTEOME-3D in several areas. One critical need is the development of a customizable database interface, implemented as a Web-based form, allowing users to specify attributes of interest from a defined set of public databases for inclusion in their local, queryable database. In this way the software can be adapted to the specific and evolving needs of individual laboratories. In our initial implementation, we retrieved data from GenBank and OMIM databases. We intend to expand that list to include AmiGO, the Gene Ontology database, to enable queries on a standardized set of annotations (16); Alliance for Cellular Signaling's Molecule Pages database, to obtain qualitative and quantitative data requisite to network modeling (41,42); and additional databases, such as the Biomolecular Interaction Network Database (12), Human Protein Reference Database (www.hprd.org), liveDIP (7), and KEGG (www.genome.ad.jp/ kegg/kegg2.html), to provide pertinent data on protein activities, networks, pathways, and interactions.
We also envision expanding our software to automatically interface with key bioinformatics resources. For example, the Virtual Cell project developed at University of Connecticut Health Center (43) provides tools for building testable models based on experimental results. It is an invaluable aid in understanding biological systems, fine-tuning theoretical models, and guiding the direction of future research. Virtual Cell is already accessible from our protein page interface via a customized link. We are currently working toward automating at least part of the model-building process, using data retrieved from public databases and supplementing with experimental data generated in the laboratory in order to automatically create the initial compartments and species of the Virtual Cell. As reference databases mentioned above become more densely populated with both qualitative and quantitative protein interaction data, more of the model-building process will be adaptable to automation. Also, integrating output from the Virtual Cell simulation with PROTEOME-3D's graphical display will allow a visual comparison of Virtual Cell's predicted cell protein profiles with the corresponding experimentally determined abundance ratios. This automation will expedite the detailed analysis of large-scale datasets by virtual-experimental tools. PROTEOME-3D's multiple-experiment comparison feature is a flexible tool that can be used effectively in a number of diverse applications. For instance, as quantitative protein profiling studies become more prevalent (44), the experiment/ experiment comparison can be used to simultaneously display pathological versus normal tissue samples from multiple patients in order to visualize how the profile patterns differ. Alternately, this feature provides the means for simultaneous comparison of protein profiles generated by two different methodologies, such as the microarray-generated differential expression data and ICAT-LC/MS/MS-generated abundance ratios displayed in Fig. 6, to see how well their protein compositions and abundance ratios coincide under a given set of experimental conditions. Although a number of graphical programs have been recently introduced for use in the analysis of microarray datasets-for example, Gene MicroArray Pathway Profiler (45), Onto-Express (46), MAPPFinder (47), and GoMiner (48)-PROTEOME-3D represents a novel approach to initial data exploration and knowledge discovery. Rather than the textbased graphical files that are output by the programs mentioned above, PROTEOME-3D displays experimental results in the form of interactive 3D proteomic landscapes, which are easily interpreted visually and easily superimposed to provide a visual experiment/experiment comparison as well. In addition to its flexible graphical display, PROTEOME-3D constructs a customized, queryable local database of expressed proteins/mRNAs from the public knowledge base and provides an interactive user interface for efficient and systematic analysis of experimental results. This paper has detailed specific features of PROTEOME-3D and described numerous sample applications, although the tool has a general applicability that extends beyond the specific examples presented here. Its utility in the analysis of large-scale experimental datasets makes it an invaluable tool in multiple biological and computational research environments.