|
Advertisement | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Molecular & Cellular Proteomics 5:293-305, 2006.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ABSTRACT |
|---|
|
|
|---|
Microarray technology currently offers the fastest and most comprehensive molecular evaluations. The estimated total number of genes in humans is 2035,000, suggesting that currently available gene chips cover a very large proportion of all expressed genes (5). However, physiology is based on the level of proteins and their activation or deactivation by post-translational modifications such as glycosylation, acetylation, phosphorylation, myristoylation, or proteolytic cleavage. Changes in these parameters are not revealed by measuring mRNA levels. Use of protein-based assays is the obvious and easiest way to ensure that these biologically important changes are detected. Indeed the combined information from gene- and protein-based assays can complement each other and produce more complete information than can be gained from either alone. Unfortunately this combined approach is infrequently done and hardly ever presented.
2D PAGE is a widely used technique for the separation of proteins and comparative proteomics. This method separates proteins based on two molecular properties. The first dimension separates proteins based on their pI. The second dimension separates proteins based on their size. After staining, each protein is found in a spot at coordinates that represent its unique size and pI (6). Depending on the sample size, composition, and gel parameters, several hundred up to 10 thousand individual protein spots can be detected on a gel (7).
In comparative proteomic studies, the 2D PAGE images are analyzed to compare the protein spot patterns and identify differences between samples or groups of samples. Computer programs have been developed to aid the detection of protein features (810). Matching algorithms in these programs ensure that comparable protein spots on different gels are correctly matched and obtain quantitative measurements for each spot based on pixel intensity and volumes. The large amount of data generated from each sample requires the use of appropriate statistical methods to extract the information of interest. Traditional analysis of proteome data has consisted primarily of univariate analysis, for example, of treated versus control populations using Students t test. Yet if the changes in the expression of several proteins combined creates a consistent pattern of variation that is related to a specific biological mechanism, a univariate approach may not be sufficient. Fortunately DNA microarray analysis programs were designed specifically to handle large datasets generated from biological systems. They have built-in functions with good graphical interfaces and do not require the user to have programming skills. Clustering algorithms and PCA in these programs identify patterns of expression that may suggest co-regulation. In this study we describe an approach to export 2D gel proteome data from an image analysis program, import it into a gene analysis program, customize files, normalize datasets, and conduct basic statistical analyses as well as clustering algorithms and PCA on the proteome data. This approach produced substantial additional information and insights beyond what the simple t test in the image analysis software could produce.
Many investigators are characterizing both transcriptomes and proteomes of their samples to gain a more complete molecular phenotype. We have previously described our protocols to isolate both proteins and RNA from the same samples to be processed for either proteome or transcriptome analysis (11, 12). By conducting analysis at both the protein and RNA expression levels in a process we call "molecular phenotyping," the two types of data can be used to confirm and supplement insights drawn from the other.
In this study we describe for the first time the application of molecular phenotyping to gain new insights into the development of autoimmune diabetes in mice. The NOD mouse is a model of autoimmune (type 1) diabetes mellitus. The insulin producing beta cells in the islet of Langerhans of NOD mice become the targets of a spontaneously developing autoimmune attack at around 5 weeks of age. The accumulation of leukocytes around the islets becomes increasingly more invasive and destructive until eventually the total beta cell mass has been reduced to a level where the animals cannot produce enough insulin to maintain normal blood glucose levels (1315). The prepathological stages before 5 weeks of age when leukocytes infiltrate the islet target tissue are not well understood (16, 17). We hypothesized that the aggressive phenotype of leukocytes manifested by the invasion of islets at 5 weeks of age may be developing in part because of underlying defects manifested at the molecular level much earlier. To characterize these defects we collected leukocytes at both 2 and 4 weeks of age. Systems and pathways appearing abnormal at both of these ages would represent "basic defects" driven much more by underlying genetics than by the developing pathology. For the comparison we decided to use the C57BL/6 mouse strain that has been extensively used for immunology research and has shown no indication of being prone to development of autoimmune diseases or of suffering from any other major immune response abnormalities. The design of studying two strains at two ages allows us to focus on consistent basic defects and avoid artifacts of unique short lived events at a specific age. The use of two-way ANOVA on these data still allows us to reveal strain differences even if there is an underlying shift in expression due to age/development. The proteome and transcriptome data were analyzed to identify both proteins and genes that were differentially expressed between the two strains. The resulting lists of proteins and genes were analyzed for the presence of members of known molecular networks. These networks showed substantial overlap and connectivity with the common biological "theme" of cell proliferation and cell death.
| EXPERIMENTAL PROCEDURES |
|---|
|
|
|---|
Image Analysis of Gels
The dataset consists of 20 samples total that can be divided into four groups of five if both strain (NOD and C57BL/6) and age (2 and 4 weeks) are considered. These groups can be combined to two groups of 10 (either by age or by strain) to allow data analysis by Students t test. Within the "MatchSet" produced by the image analysis software, 251 spots were identified as being expressed in at least one of the 20 gels. We used the default normalization of data in the MatchSet that is based on total pixel quantity in valid spots. This normalization method assumes that few protein spots change within the experiment and that the changes average out across the whole gel. PDQuest software includes three statistical tests for data analysis, the Students t test (a parametric test) and the Mann-Whitney rank sum test and Wilcoxon signed rank test (both non-parametric tests). The Students t test was performed, comparing protein expression in the two mouse strains (n = 10, p
0.05).
Export of Data and Preparation of Formatted Expression Files
The intensity values that are generated for each spot along with the spot identification number can be easily exported to a Microsoft Excel spreadsheet. After the MatchSet was loaded, "export MatchSet" from the export submenu (in the file menu) was selected. A new dialogue window opened with options to either export data for each gel in a MatchSet (Spot Data by Gel) or for each replicate group in the MatchSet (Spot Data by Group). The "spot data by gel" option was selected, and then the data for each gel/spot to be exported were selected. For this analysis all but the "normalized quantity" was deselected because that was the only information to be analyzed. The SSP number for each spot was included, and the file was exported. The file type for the exported data automatically defaults to an .xls spreadsheet that can be opened with Microsoft Excel or other spreadsheet software. From within the spreadsheet software, spot IDs were ordered alphanumerically as rows, and spot-normalized intensities were ordered alphanumerically as columns. For each gel there is a value given if a spot is not found on that gel (the background value). In the 20 gels of our dataset this value was between 4 and 119. Although this is much smaller than the values of even faint spots (usually above 1000), on a log scale it can represent a substantial difference, which in log-based graphic presentations will appear as differences in "spot intensities" (although there are no spots). To avoid misrepresentation, all background values were increased to 119. Although this was expected to reduce the number of differentially expressed spots marginally, in our dataset the effect was so small that we got identical lists of differentially expressed spots with or without the background value correction. The final step was to save the spreadsheet as a tab-delimited text file (.txt).
Analysis of Proteome Data in Gene Expression Analysis Software
We have used a software called GeneSpring (version 6.2, Silicon Genetics, Redwood, CA) to analyze our expression array data and therefore used that program in the current protein expression analysis. Similar analyses can be done using other commercially available or free gene expression analysis software (20). The software provides option to use Students and Welchs t tests as well as one- and two-way ANOVA for data analysis. Those tests allow user-defined parameters and provide selection of a number of posthoc tests to correct for multiple testing. There is also a variety of clustering algorithms including K-means and hierarchical clustering, again with user-defined parameters. Principal component analysis may be done on "genes" (the term assigned to proteins in this analysis) or on samples. We used all of these tests as described under "Results."
The MatchSet expression levels spreadsheet (in tab-delimited ASCII file format) was imported into GeneSpring using "import data" from the file menu. The proteome file is classified as a "custom array," therefore one must create a template to identify the columns in the spreadsheet to be used for analysis. This option requires the user to "create a custom genome." Therefore, File > Import Data > Create New Genome > Next was selected. For this analysis the "spot ID" is the "gene identifier," and the "normalized spot intensity" is the "signal." The standard procedure for "creating an experiment" was followed (in this software that refers to grouping all of the samples to be analyzed and defining them as an "experiment"). The next step involves defining normalization and sample groupings for this dataset. There are many options for normalization. In this experiment we only selected "per gene normalization" because this is more appropriate for a small number of data points. The next step is to define the experiment parameters for each sample, in this case age and strain. The final step was to select those parameters that will be considered in those analyses that require groups to be defined. In this experiment the mice can be classified by strain without regard to age (NOD or C57BL/6, n = 10), by age without regard to strain (2 or 4 weeks, n = 10), or by age and strain (2- or 4-week-old NOD, 2- or 4-week-old C57BL/6, n = 5).
Spot Identification
Protein spots that were found to have interesting patterns of differential expression were subjected to identification of protein(s) present in the spot using peptide mass fingerprinting as described previously (11).
Transcriptome Analysis
The RNA isolated from the trireagent extract of the cells was subjected to gene expression analysis on Affymetrix MOE430A and MOE430B expression arrays using procedures described previously (12). These expression arrays each contain sequences from 22,000 probe sets (genes) whose expression levels are evaluated using the MAS 5.0 software (for details see www.affymetrix.com). The expression data produced by the MAS 5.0 software were imported into GeneSpring software and analyzed. In GeneSpring we conducted two-way ANOVA using the Westfall and Young multiple test correction on the 26,530 probe sets in which at least one of the 20 samples had produced a present or marginal call by the MAS 5.0 software.
Analysis of Molecular Networks
To investigate whether lists of differentially expressed proteins belong to specific pathways, we conducted "Ingenuity Pathways Analysis" on a subscription, web-delivered application that enables biologists to discover, visualize, and explore networks relevant to their experimental results such as gene expression array datasets. We used this service for both our lists of proteins and RNA expression data. For a detailed description of Ingenuity Pathways Analysis visit www.ingenuity.com.
The files containing gene identifiers (Swiss-Prot identifiers for protein lists and Affymetrix probe set identifiers for gene lists) were uploaded as a tab-delimited text file. Each gene identifier was mapped to its corresponding gene object in the Ingenuity Pathways Knowledge Base. The genes on the uploaded lists, called Focus Genes, were then used as the starting point for generating biological networks. To start building networks, the application queries the Ingenuity Pathways Knowledge Base for interactions between Focus Genes and all other gene objects stored in the knowledge base and generates a set of networks with a network size of no more than 35 genes/proteins. Ingenuity Pathways Analysis then computes a score for each network according to the fit of the users set of significant genes. The score is derived from a p value and indicates the likelihood of the Focus Genes in a network being found together due to random chance. A score of 2 indicates that there is a 1 in 100 chance that the Focus Genes are together in a network due to random chance. Therefore, scores of 2 or higher have at least a 99% confidence of not being generated by random chance alone.
The networks are displayed graphically as nodes (genes/gene products) and edges, or lines (the biological relationships between the nodes). The node symbols used by Ingenuity are derived from the symbol of the gene encoding the corresponding protein/mRNA. Human, mouse, and rat orthologs of a gene, although stored as separate objects in the knowledge base, are represented as a single node in the network. Nodes are displayed using various shapes that represent the functional class of the gene product. Lines with arrows indicate induction, whereas other lines indicate different types of relationships such as phosphorylation, enzymatic action, etc.
Biological functions were assigned to each gene network by using the findings that have been extracted from the scientific literature and stored in the Ingenuity Pathways Knowledge Base. The biological functions assigned to each network are ranked according to the significance of that biological function to the network. A Fischers exact test is used to calculate a p value determining the probability that the biological function assigned to that network is explained by chance alone. We listed only the three most significant biological functions for each network.
| RESULTS |
|---|
|
|
|---|
|
|
In contrast to the image analysis program, the gene expression analysis program allowed us to conduct a two-way ANOVA. This analysis was conducted on the spot intensity data to evaluate age effects, strain effects, and interaction between the two parameters. In that analysis we found that eight spots had significant (p < 0.05) strain effects, four spots had age effects, and 12 spots showed significant interaction between the two parameters. Three spots belonged to all three of those groups (had significant strain, age, and interaction effects). The remaining spots had only one of the three types of effect (Fig. 3). As expected, although there is some overlap between the lists of proteins showing strain differences by ANOVA and by Students t test, there are also differences.
|
|
|
Identification of Proteins in Differentially Expressed Spots
In the above sections we have described a number of different data analysis approaches that identified groups of protein spots whose intensity appeared to be associated with interesting differences in group parameters. Each of these spots were subjected to tryptic digestion and peptide mass fingerprinting to identify protein(s) in the spot. Table I shows the identity of the spots from the t test, ANOVA strain effects, additional spots with strain and age interaction by the ANOVA, the K-means cluster that followed the strain difference pattern, and finally the PCA group. These data indicated that in the PCA group was a protein (ATP5B) that also was identified in the K-means cluster, suggesting the potential that the PCA group also may associate with a strain difference pattern.
|
The Ingenuity database had information about 23 of the 24 proteins, and two networks with highly significant scores of 18 and 14, respectively, were created. Located within these two networks (of 35 genes each) were 10 and eight focus genes, respectively. The remaining five proteins did not fall into larger networks, although the database contained some information about them (Fig. 6). In Table I we have indicated both the gene symbol produced by Ingenuity as well as (in parentheses) the specific network to which each protein belongs.
|
|
Function and Interconnection of Networks
The Ingenuity server provides a list with information regarding all networks that have been calculated from an uploaded list of focus genes. The server provides information about the number of focus genes in each network, the likelihood of these focus genes being in a network due to random chance (the score), and the biological functions significantly over-represented in each network. In Table II we show the score, number of focus genes, and top three biological functions for each of the two proteome networks (P1 and P2) and the six transcriptome networks (T1T6). Within these eight networks, cancer is represented in three (P1, T2, and T4), cell cycle is represented in three (T4, T5, and T6), cell death is represented in two (P2 and T1), cell signaling is represented in two (T1 and T6), and cell-to-cell signaling and interaction is represented in two (P1 and T3). No other biological function was among the top three for more than one network. The overall theme for these networks (representing differences in leukocytes from diabetes-susceptible NOD and control C57BL/6 mice) appears to focus on cell proliferation and death with the expected association of inter- and intracell signaling.
|
|
|
| DISCUSSION |
|---|
|
|
|---|
Using a mixture of statistical, clustering, and PCA analysis on a 2D gel proteome dataset from NOD and C57BL/6 leukocytes, we produced a list of 37 spots that were likely to be differentially expressed between the two strains of mice. Admittedly because of the methods used to produce this list, it was also likely that some of these 37 spots had no expression differences between strains. The lack of statistical significance is acceptable in a discovery project where type 1 errors are much less of a concern than in traditional hypothesis testing because the data are not used to draw any conclusions before additional validation approaches in the downstream process have been applied. In our case these processes included narrowing the list of interest with network analysis and comparison with transcriptome data collected by a much more statistically stringent process. This study not only demonstrates the use of gene expression software for analysis of proteome data generated from 2D gels, it also demonstrates that this approach allows investigators to extract much more information from a proteome dataset. The t test available in traditional image analysis software produced a list of only seven spots differentially expressed between strains.
Unfortunately proteins in 10 of the 37 differentially expressed spots could not be identified, and only 24 different proteins were identified because some spots were found to contain the same protein. The identification of the same protein in two spots is likely due to the fact that many translational modifications create separate new protein spots on a 2D gel due to significant shifts of the pI of the proteins. It is important to note that data on spot intensity in a 2D gel are not the same as a protein assay for that protein. In most cases a 2D gel spot only reflects the amount of the protein that is or is not modified by a specific post-translational modification.
Network analysis using the Ingenuity server indicated that of the 24 proteins identified six did not belong to a network that could be produced from the information available in their database. One of those six had no information available in the database; the remaining five had information in the database, but it could not connect them to each other or to the two networks within the limitations on network size of no more than 35. Each of these six proteins could be false positives or simply reflect that the database or the scientific community does not know about connections that actually exist. The latter problem is particularly large for the use of network analysis on our transcriptome data because the Affymetrix arrays contains a large number of expressed sequence tags whose gene products have not yet been studied. Finally it is also possible that the non-connected proteins could represent false identification of a differentially expressed protein such as may occur if more than one protein is present in a spot.
The Ingenuity server restricts the size of networks to a maximum of 35 genes and attempts to create networks that contain the highest possible number of genes from the uploaded list (called focus genes). Our proteome data produced two networks containing 10 and eight focus genes, respectively. With scores of 18 and 14 the probability of creating these networks by chance was very low. A quick look at the information in the Ingenuity database clearly indicated that many genes in either network were connected to genes in the other network and that it would be appropriate to consider this a single network of 70 genes (18 of which were differentially expressed). The reasons that we do not detect all the genes in these networks as differentially expressed at the protein level could be that they are not expressed in high enough amounts to be detected on a 2D gel, or perhaps any differential expression is simply not large enough to be determined by this technology.
It is interesting that the biological functions of some of the central genes in this proteome network are centered around the opposing biological processes of cell proliferation (oncogenes Myc and Mycn) and cell death (Bcl2 and Casp3). Based on our proteome data one could hypothesize that a struggle between these two processes may lead to the abnormal expression we observe for a number of genes at the protein level. As with any other discovery dataset the hypothesis produced will have to be tested by traditional scientific methods. Indeed more work is needed just to create a specific and testable hypothesis or model that may involve mutual interactions and regulatory effects of some or all of the 70 genes in this proteome network.
To further support the conclusions regarding biological processes and specific networks that differ between NOD and C57BL/6 mice, we also conducted analysis of the gene expression at the mRNA level in the same 20 samples on which we had conducted proteome analysis. The network analysis of a gene list produced by stringent statistical analysis of these data produced six networks. The proteome networks shared one or more genes with all of these transcriptome networks except for transcriptome network 1. This sharing included that both transcriptome and proteome networks contained the oncogenes Myc and Mycn. However, the differentially expressed proteins and transcripts pointing toward those central genes were different. Although expression of a gene at the mRNA and protein level does not always correlate, other reasons may explain why we have so little overlap between lists of differentially expressed proteins and transcripts: 1) spots on 2D gels often only represents a single post-translationally modified version of a protein, not the total expression of a gene at the protein level, 2) the analysis approach and statistical threshold used on the two types of data are very different, 3) our proteome analysis is limited to proteins that have pI of 47, and 4) the fraction of the total proteome covered is much smaller than the fraction of the total transcriptome covered. A final point regarding the proteome and transcriptome based networks is that genes from each of the eight networks have known connections to multiple genes from other networks. It is clear that the whole system of eight networks has such a degree of connection to each other that they can be viewed as a single network containing 272 genes of which 82 are differentially expressed.
Each of these eight networks is assigned functions based on the known functions of their members and the statistical likelihood that a specific function would be represented with that frequency by chance. We compared the three highest ranking (most significant) functions over-represented in these networks and found that most of them centered on the processes of cell proliferation and death. Signaling between and within cells was the second systems level process that could be seen in those data. In contrast the process of immune response was only listed as third rank in a single network, suggesting that at this young age the excessive activation of the immune response in NOD mice is still rather minimal.
Defects in the process of apoptosis have been observed in NOD mice, and it has been hypothesized to be important in explaining why autoreactive leukocytes in this mouse persist (2729). Indeed one report indicated that defects in apoptosis could be a common theme for many autoimmune diseases (30, 31). The oncogenes Myc and Mycn are central to both proteome and transcriptome networks created by our data. The resistance of lymphocytes from NOD mice to glucocorticoid-induced apoptosis was associated with an up-regulation of Myc in contrast to C57BL/6 mice where Myc was down-regulated (32), and treatment of NOD mice with the streptococcal wall component OK432 prevents diabetes, restores dexamethasone-induced apoptosis, and was shown to be associated with down-regulation of Myc (33). Our systems biology data suggest that these apoptosis defects may be basic defects that are manifested even before the processes of immune response and immune activation become dominant. Our data also point to an association of these apoptosis defects with the Myc/Mycn oncogenes and demonstrate a large number of new genes that may be affected via these oncogenes. Furthermore many of the differentially expressed genes that are directly connected to these oncogenes are translation factors and ribosomal proteins. One of these, the ribosomal protein RPL8, appears to be expressed at very low levels in NOD mouse leukocytes, suggesting that basic cellular functions could be affected probably via Myc/Mycn.
Our data support the idea that the balance between cell death and cell survival is defective in NOD mice and suggest that this is an early and central defect that is manifest in peripheral leukocytes even before the immune system is activated. Our data also point to specific networks that may be worth further investigation to gain a deeper insight into the origins of this dysfunction. We cannot rule out that some of the differences between NOD and C57BL/6 mice found in this study could represent genetic diversity not connected to autoimmunity at all. However, many of the differentially expressed genes are clearly connected to processes and networks known to be associated with development of autoimmunity. Comparison of NOD to other control strains should help determine, for each specific network/process, whether it is more connected to benign genetic divergence than it is to susceptibility to autoimmunity.
| ACKNOWLEDGMENTS |
|---|
| FOOTNOTES |
|---|
Published, MCP Papers in Press, October 16, 2005, DOI 10.1074/mcp.M500197-MCP200
1 The abbreviations used are: 2D, two-dimensional; NOD, non-obese diabetic; PCA, principal component analysis; MTC, multiple test correction; ANOVA, analysis of variance; SSP, standard spot. ![]()
* This work was supported by NIDDK, National Institutes of Health Grant DK62103 and National Center for Research Resources Grant RR15373. ![]()
|| To whom correspondence should be addressed: Div. of Endocrinology, University of Tennessee Health Science Center, VAMC Research 151, 1030 Jefferson Ave., Memphis, TN 38104. Tel.: 901-523-8990 (ext. 5088); Fax: 901-577-7273; E-mail: igerling{at}utmem.edu
| REFERENCES |
|---|
|
|
|---|
: analysis of proteins involved in insulin resistance.
J. Lab. Clin. Med. 145, 275
283[CrossRef][Medline]This article has been cited by other articles:
![]() |
G. Kamalov, R. A. Ahokas, W. Zhao, A. U. Shahbaz, S. K. Bhattacharya, Y. Sun, I. C. Gerling, and K. T. Weber Temporal responses to intrinsically coupled calcium and zinc dyshomeostasis in cardiac myocytes and mitochondria during aldosteronism Am J Physiol Heart Circ Physiol, February 1, 2010; 298(2): H385 - H394. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. C. Kaizer, C. L. Glaser, D. Chaussabel, J. Banchereau, V. Pascual, and P. C. White Gene Expression in Peripheral Blood Mononuclear Cells from Children with Diabetes J. Clin. Endocrinol. Metab., September 1, 2007; 92(9): 3705 - 3711. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. H. Wagner, A. M. Drewry, S. MacMillan, W. M. Dunne, K. C. Chang, I. E. Karl, R. S. Hotchkiss, and J. P. Cobb Surviving sepsis: bcl-2 overexpression modulates splenocyte transcriptional responses in vivo Am J Physiol Regulatory Integrative Comp Physiol, April 1, 2007; 292(4): R1751 - R1759. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| All ASBMB Journals | Journal of Biological Chemistry |
| Journal of Lipid Research | ASBMB Today |