|
Advertisement | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Molecular & Cellular Proteomics 8:451-466, 2009.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ABSTRACT |
|---|
|
|
|---|
Several genomics and proteomics approaches to identifying biomarkers for HD have been undertaken previously. Genomics studies have determined the molecular phenotype of human HD brain (2) and different tissues of HD mouse models at the mRNA level (3–6). Proteomics approaches have been applied to brain tissues of HD mouse models and humans to identify candidate markers (7–9). Blood plasma in particular has received considerable attention recently because of its ready accessibility clinically (10, 11). The candidate protein biomarkers identified in the blood proteomics studies are largely known inflammatory markers. Because HD is regarded primarily as a neurodegenerative disease, it is not entirely clear how directly general markers of neuroinflammation relate to the pathophysiology of HD, although astrocytosis and microgliosis (12) are prominent components of HD in its mid- to late stages (13). Another concern regarding markers discovered primarily in blood is that the blood-brain barrier may restrict brain proteins from entering plasma, and so plasma candidates may not directly reflect HD progression in the brain.
Cerebrospinal fluid (CSF) is a more relevant biomaterial for biomarker discovery because it is proximal to the brain; it occupies the subarachnoid space of the central nervous system and the ventricular system around and inside the brain. Changes in CSF proteins have been identified for several diseases (14–17), and oligoclonal bands in CSF have long been used to aid in diagnosis of multiple sclerosis and encephalitis (18–20). CSF is an ultrafiltrate of arterial blood produced by the choroid plexus in the lateral, third, and fourth ventricles. However, it has been estimated that about 20% of the proteins in CSF are derived from brain (21), making CSF an attractive source of potential disease biomarkers in neurodegenerative diseases such as Alzheimer and Parkinson diseases (16, 22, 23). We report here an integrated proteomics approach to characterize the constituents of CSF and identify potential markers in CSF for human Huntington disease.
In this study, we analyzed and interpreted human HD CSF proteomics data generated by four laboratories using different proteomics approaches, including separation strategies, pooling strategies, depletion of proteins, quantitation methods, and mass spectrometry instruments. Although acquired using different biochemical approaches, all data were interpreted using a common protein database, algorithms for database search (24), and peptide and protein identification (25, 26) and quantitation (27) methods to allow comparison across laboratories.
The preplanned primary analysis of these data includes deriving rankings for protein changes in HD based on the synthesized data from all laboratories and then assessing biological and statistical significance by interrogating the rankings with gene annotations derived from independent data sets (e.g. gene set enrichment style analyses (28)). Annotations include the tissue specificity of a gene (e.g. brain or liver) and whether a gene is significantly changed in human HD brain when compared with non-HD brain, both derived from previously published data sets profiling the transcripts of normal human tissues (29) and human HD and non-HD brains (2).
This analysis reveals that proteins that have specifically high expression in the brain (brain-specific) are 1.8 times more enriched in CSF than in human plasma. These brain-specific proteins overall have lower concentrations in HD than normal CSF, and 81% of them are concordant with previously identified mRNA changes in HD versus normal brain. Altogether these results suggest that measuring proteins in CSF may be a useful way to assess the health of the brain, track progression of the disease, and improve our understanding of the disease.
Secondary analysis was also performed to investigate the concordance of protein changes across laboratories. Overall at the protein (e.g. International Protein Index (IPI) sequence) level, there is a low concordance of the disease/control ratios among laboratories, meaning that each laboratory would report different highest (or lowest) ranking proteins. However, the laboratories are consistent with respect to the overall trends that brain-specific proteins decline and liver-specific proteins increase in HD samples. Concordance in protein ratios and overall trends is greatest between the two laboratories identifying the highest number of proteins. This supports an argument that future studies with resource limitations should emphasize the depth of protein coverage and include multiple laboratories only if their experimental methods complement each other in terms of the protein observations (30). Our study also suggests that data integration plays a central role in studies to identify biomarkers as statistical significance could not have been demonstrated without the ability to evaluate changes in predefined groups of proteins using the gene set enrichment analysis (GSEA) approach.
| EXPERIMENTAL PROCEDURES |
|---|
|
|
|---|
80). About 5–7 ml of CSF samples were collected in four or five standard lumbar puncture kit tubes (CardinalHealth, safe-t-LP kit) (catalog number 4301CSDF). Each collected sample was placed on ice and then centrifuged at 2000 x g (4000 rpm) for 10 min to eliminate cells and other insoluble material. The collected CSF was examined by microscopy, aliquoted, and frozen immediately on dry ice in polypropylene tubes in 1- or 3-ml aliquots and stored at –80 °C. Tubes were filled to the top to minimize oxidation during storage. Average total processing time was 76 min from the start of collection to final storage. No anticoagulants, preservatives, and protease inhibitors were added. The lumbar punctures were atraumatic with CSF cell counts revealing red blood cells from 0 to 171 counts/µl and white blood cells from 0 to 17 counts/µl, indicating no significant blood cell contamination (supplemental Table S1). Samples were stored at –80 °C for various lengths of time ranging from 17 to 27 months before they were thawed and subdivided into aliquots of 0.5 ml to be shipped using dry ice to individual labs for analysis. The duration of shipment was between 1 and 3 days, and dry ice was replenished during the shipment to keep samples frozen. All 30 samples were stored at –80 °C before thawing once again for analysis. Therefore, except one (HDU-2) that was thawed three times, all samples were thawed twice before analysis.
Proteomics Platforms
Five different laboratories received aliquots of CSF collected from the same 30 individuals described above. The disease statuses of the 10 HD gene-negative controls, 10 HD gene-positive early stage samples, and 10 HD gene-positive mid-stage samples were blinded to the research laboratories and labeled as group B, C, and A, respectively, with identifiers provided to labs only after raw data from all labs were received by the bioinformatics data core analysis group.
Each lab designed experiments based on their preferred comprehensive proteomics platform(s) (Table I) for the purpose of discovering biomarkers that can classify Huntington disease. These approaches include four quantitative mass spectrometry approaches and one gel-based quantitative approach that used the mass spectrometer for protein identification. Experimental designs varied in many respects across the labs, including the use of pooled and non-pooled designs, depletion and non-depletion of abundant proteins in the samples, and label-free and isotopically labeled quantitation. After evaluation, data from four laboratories were reported in this study. One of the five laboratories reported quality control issues that were also detected in the data analyses (fewer than 80 total peptides were identified), and therefore this data set was not considered for further analysis. Detailed experimental designs of the other four laboratories are described in the supplemental text, Part A.
|
The following common criteria were applied to all searches. A ±2.0-Da error from the calculated peptide monoisotopic mass was allowed to determine whether a particular peptide sequence is to be considered as a possible model for a spectrum. Mass tolerance for fragment ions was 1 Da (24). The maximal number of missed cleavages permitted was 2. A static modification on cysteines of +57.021 Da was used for all labs except Lab 4, which performed acrylamide labeling on cysteines. A potential modification on methionines of +15.9949 Da was used. A weighted average mass was used to calculate the masses of the fragment ions in a tandem mass spectrum. A minimum number of one ion was required for a peptide to be scored. A default minimum PeptideProphet probability of 0.2 was used to calculate the protein group probability. Only peptides with probability
0.75 and mass error <20 ppm were selected for quantitation. Specific search parameters for the different labeling schemes by different labs are specified in the supplemental text, Part B. The following are descriptions of the software used for each lab's data and quantitation methods (summarized in supplemental Table S2) during data processing.
Lab 1—
Lab 1 used a DIGE method and quantitated a large number of fluorescent spots with two commercial software algorithms (DeCyder 6.5 (GE Healthcare) and SameSpots 2.0 (Nonlinear Dynamics)). A t test was performed on the log ratios using Statistica for Windows (StatSoft, Inc.) version 7 to estimate the significant difference of a protein. Only those spots found to have significant changes, based on fluorescence, between the HD and the control were selected for tandem MS analysis. Peak lists of MS/MS spectra acquired on the HCT-Ultra ion trap instrument (Bruker Daltonics) were generated using the software tool DataAnalysis 3.4.179 (Bruker Daltonics) with default parameters. The built-in algorithm version 2.0 was used, and neither smoothing nor any signal-to-noise filter was applied for compound detection. A maximum charge state of 3 was considered for deconvolution. Data were then converted to mzXML files using CompassXport (version 1.2.3). Peptides were identified using decoy database methods with an approach described by Elias and Gygi (34). After proteins were identified for selected spots using false discovery rate (FDR) and ProteinProphet (see details in the supplemental text, Part B), quantitation of -fold changes among different disease statuses was processed based on the following rules. 1) If a spot was quantitated by both DeCyder 6.5 and SameSpots methods, -fold changes by DeCyder 6.5 were selected because the differential expression resulting from this method is more significant on average. On the other hand, if only one quantitation method was used, results from that method were selected. 2) When multiple spots have the same protein identification, -fold changes were averaged for that protein. 3) When a spot resulted in several protein identifications, the same -fold changes were assigned to all proteins.
Lab 2—
Lab 2 performed label-free analysis. Peak lists were generated using MassLynx (4.0) based on signals obtained by Q-ToF Micro spectrometer from Waters Micromass. The MS duty cycle was set at 1,1,4 as described in detail in the supplemental text, Part A. These data were converted to mzXML files using MassWolf 1.02 with the Waters Datafile Access Component (DAC) library. Following the database search using the common criteria, quantitation was performed using a spectral count approach (35), which sums the number of total spectra assigned to the protein group in that sample. Only peptide spectra with PeptideProphet probability greater than 0.75 or an error rate of 5% were counted for each IPI entry identified. Because individual level variation can be determined for this design, we used a straightforward procedure that tracks all proteins that are members of a single ProteinProphet group within the experiment to associate groups across multiple experiments and to flag groups that are not directly comparable (supplemental text, Part B). After summing up to master protein groups across samples, the average spectral count of all IPI entries within a "master group" was assigned to that group. For each master group, total spectral counts from 10 HD-mid samples, 10 HD-early samples, and 10 control samples were summed up and used to calculate HD-mid/control, HD-early/control, and HD-mid/HD-early ratios. Intensity-dependent ratio plots (MA plots) and histograms of light/heavy ratios were examined to ensure the quality of data for the labeled experiments. M is the y-axis and A is the x-axis, where in this paper, M = Log2 (Heavy) – Log2 (Light); A = [
] (Log2 (Heavy) + Log2 (Light)).
Lab 3—
Data from Lab 3 were acquired from the LTQ-FT instrument (Thermo Finnigan). Peak lists were generated by Xcalibur (version 1.1) and converted to mzXML files using ReAdW 1.1 with XRawfile library. Default parameters were used. Database search was carried out using the common parameters and the designated specific modifications (supplemental text, Part B). Protein groups with probability score >0.9 (corresponding to an overall error rate of 0.01) were considered confident proteins for downstream analysis. The Q3 algorithm (27), developed to accommodate a 3-dalton mass shift in heavy and light peptides, was used to compute the ratios between the light and heavy isotopic pairs using peak areas. More specifically, only confidently identified peptides (PeptideProphet probability >0.75 and mass error <20 ppm) were selected for further quantitation at the protein level. In three pairwise comparisons, the internal standard (IS) containing equal amounts of 30 samples was labeled with light acetyl group, and each of the three disease statuses (A, B, or C) was labeled with heavy acetyl group (for details, see supplemental text, Part A). Preliminary analysis found that light/heavy ratios were skewed for both IS versus HD-early and IS versus HD-mid experiments. Because the same amount of protein was loaded into the MS instrument and, in theory, most proteins in the disease and control should remain unchanged, we normalized these ratios. Normalization at the peptide level was performed by median centering the log ratios. Experiments were then aligned to infer the protein changes of different disease status comparisons. Protein inference was performed using the ProteinProphet analysis tool at the lab level, and light/heavy ratios of these protein groups were inferred from ratios at the experiment level by comparing IPI numbers. HD-mid versus control, HD-early versus control, and HD-mid versus HD-early ratios for each protein group were calculated using IS/control, IS/HD-mid, and IS/HD-early ratios from the three experiments.
Lab 4—
LTQ OrbiTrap XL mass spectrometer from Thermo Finnigan was used, and Xcalibur (version 2.2) was applied to generate peak lists. Data were converted to mzXML files using ReAdW 1.1 with XRawfile library. Default parameters were used. Data were then searched with the common criteria plus specific modifications. Protein groups with probability score >0.9 (corresponding to an overall error rate of 0.01) were considered confident proteins for downstream analysis. For the labeled analysis, as with Lab 3, the Q3 algorithm (27) was used to compute ratios between light and heavy isotopic pairs. And similarly to the methods used for Lab 3, peptides with PeptideProphet scores greater than 0.75 and a mass error of less than 20 ppm were selected for the protein level quantitation. Histograms of light/heavy ratios based on the number of cysteines in the peptides revealed that peptides with one cysteine have better normal distributions than peptides with more than one cysteine. Because 80% of peptides contain only one cysteine, only peptides with one cysteine were selected for protein level analysis. To ensure a high confidence of quantitation at the protein level, only those proteins with at least three quantitated peptides were used for further analysis. For the label-free analysis, accurate mass and time (AMT) methods were used to identify peptides in LC-MS data using a single AMT database containing all high quality (PeptideProphet probability
0.95) peptide identifications from all labeled (fractionated) and unlabeled (unfractionated) data from Lab 4. The LC-MS peptide features from each unlabeled sample were matched against the combined AMT database to provide peptide assignments. Each match was assigned a probability value based on mass error and normalized retention time error between the MS1 feature and the AMT peptide entry, and only matches with probability
0.95 or a false assignment rate
0.05 were kept. LC-MS peptide intensity values for the peptides were normalized across runs, and peptide ratios were calculated. The AMT database and matching were performed using the msInspect/AMT software platform (36, 37). Protein inference was performed using ProteinProphet, and protein ratios were calculated as well. As with the Lab 3 analysis, experiments were aligned using the ProteinProphet analysis tool at the lab level to infer the protein changes of different disease status comparisons for the labeled and unlabeled methods. Ratios of these protein groups were inferred from ratios at the experiment level by IPI numbers.
Gene Name and Group Assignments
Proteins, identified by their IPI sequence, were assigned to gene symbols by IPI protein cross-reference. Because the peptide level evidence cannot uniquely identify all IPI sequences, proteins were assembled by ProteinProphet into groups (26). Some of these protein groups contain unique protein sequences and gene symbols, and some contain multiple sequences that may result from the same gene symbol or from multiple genes within the same family (e.g. a protein group is assigned with HBG2, HBE1, HBG1, HBB, and HBD, all of which belong to the hemoglobin gene family) or multiple incompatible genes.
Deriving a List of Consensus Proteins
To facilitate comparison across labs, a comprehensive list of protein groups consisting of all proteomics data reported in this study was generated by running ProteinProphet on all data. A minimum probability of 0.9 was used to generate the confident protein group list, resulting in an overall error rate of 0.01. A total of 1574 protein groups were identified, corresponding to 2012 gene symbols (supplemental Table S3). Only 34 of the protein groups were based on single peptides. Scores, sequences, and spectra for these single peptide-based proteins are provided in supplemental Table S4. HD-early/control, HD-mid/control, and HD-mid/HD-early ratios for each protein group were inferred from lab level analysis by comparing IPI numbers. For a few cases, multiple ratios were found for a protein group in HD-early versus control, HD-mid versus control, or HD-mid versus HD-early comparisons, and in those cases we took the geometric mean of these ratios. Detailed results of this search are available upon request. We have also deposited the comprehensive list of proteins at the PRoteomics IDentifications (PRIDE) database (accession number 3701).
Deriving the Rank of Proteins Changes by Combining Ratios across Laboratories
Because various methods report protein changes on different scales, to make an effective and meaningful comparison (38) we integrated protein ratios across laboratories using the following meta-analysis procedures that combine the scale-free effect size measurements (z-scores). We first transformed the ratios to logarithm scale so that all the data were in a similar range, and then we standardized the scores by centering and scaling log ratios in each lab to have mean 0 and variance 1. The resulting z-score represents the number of standard deviations above or below the mean ratios within each lab. To combine z-scores, we chose to sum them across laboratories. Although one might have considered averaging the z-scores because not all proteins are quantified in the same number of experiments, we chose to sum them so that a protein quantified by only one laboratory must have a higher ratio to achieve the same rank as proteins observed across all laboratories with consistent and modest changes. For the purpose of finding proteins that classify HD, we combined the two HD groups and summed together the sum of z-scores of HD-mid/control and HD-early/control (sums of z-scores). Proteins that are most altered in HD are those with the highest and lowest sums of z-scores.
Annotating Proteins for Tissue Specificity and for Changes in HD Brain
We next annotated each protein in the synthesized protein list based on its behavior found in human transcriptional profiles of HD and normal brain (2) and other tissues (29). Specifically we annotated proteins based on 1) tissue specificity score: the relative expression of transcript in human brain, liver, and 25 other normal human tissues; and 2) changes in HD brain: the ratio of transcript abundance between HD and non-HD brains. All annotations were made by comparing gene symbols. IPI numbers identified in our proteomics study were associated with gene symbols by reference to the data (ipi.HUMAN.v3.20.dat) provided by the International Protein Index managed by the European Bioinformatics Institute.
Annotation of Protein Changes in CSF with Changes in the HD Brain by Microarray Analysis—
We compared protein changes from this proteomics study with the log2 -fold changes of mRNA in HD versus normal brain in caudate nucleus, cerebellum, and motor cortex (Brodmann area 4 (BA4)) based on a previously published study by Hodges et al. (2). Probe sets with significant changes (p values <0.001) were selected and collapsed to gene level based on gene symbols. When a gene has multiple probe sets, the median log2 -fold change of that gene was selected as the estimate of the mRNA change. As a result, 6183 genes in caudate nucleus, 1143 genes in BA4 cortex, and 440 in cerebellum have significant mRNA changes in HD brains from the normal.
Signal Processing of Tissue Transcriptional Expression Data—
Because a significant fraction of plasma proteins is derived from various tissues and so are CSF proteins, which are filtered from plasma, to annotate proteins for their source tissues, it is critical that the transcriptional data set includes tissues of human major organs. The normal human tissue expression data set was acquired from published data provided by Ge et al. (29) that includes a total of 36 types of normal human tissue, covering the complexity of human tissues. Because all CSF samples are from adult human, the annotation will focus on the adult tissue expression pattern. Therefore, data from three fetal tissues were removed. Ge et al. (29) examined a "whole brain" and six subregions of the brain (amygdala, corpus callosum, caudate nucleus, cerebellum, hippocampus, and thalamus). However, the cortex region that is noted to have severe cell loss in Huntington disease patients was not included. We removed the "whole-brain" data because it is not obvious to us how this RNA sample was prepared. We added the cortex signals and substituted cerebellum and caudate nucleus data with data derived from an extensive study that was carried out on four parts of human normal brain (caudate nucleus, cerebellum, and BA4 and BA9 cortex) using the same microarray chips (2). After these brain data were obtained from Gene Expression Omnibus (GEO) DataSets, signals detected using Affymetrix microarray suite version 5 software (MAS5) for each probe were averaged over 21 caudate nucleus, 21 cerebellum, and 24 cortex (12 BA4 and 12 BA9) arrays. We plotted log2 MAS5 signal of the caudate nucleus and cerebellum from Ge et al. (29) versus those from Hodges et al. (2) and found that the correlations are both 0.90 (supplemental Fig. S1). These high correlations suggest that data from the two studies may be combined. So far, we have data for seven subregions of the brain. Because we want to annotate a gene as being active in the brain if it is active in any part of the brain, we summarized the brain expression data by taking the maximum across all seven subregions. As a result, transcriptional data of "brain tissues" and 26 other types of human tissues were included in our analysis. These tissues are from brain, heart, thymus, spleen, ovary, kidney, muscle, pancreas, prostate, intestine, colon, placenta, bladder, breast, uterus, thyroid, skin, salivary, trachea, adrenal, bone marrow, pituitary, spinal cord, testis, liver, stomach, and lung (supplemental Table S5). Finally many genes have multiple probes; one can choose to use the average signal or the maximum signal for each gene. In our analysis, the one with maximum signal among all tissues was selected because we considered that the maximum signal is the highest above the noise level. As a result, we observed 1941 tissue markers based on human array data (supplemental Table S6). The definition of tissue-specific genes/proteins may vary with tissues included in the study and when the thresholds change.
Defining Tissue Specificity of a Gene—
Tissue specificity was derived from the human transcriptional data set profiling seven brain tissues plus 26 other normal tissues based on publicly available resources after processing as described above. The tissue specificity score for a gene is determined by the relative intensity of its probe on the array across tissues. We defined a gene as tissue-specific if the maximum intensity of its probe was highest in that tissue and the maximum intensity in every other tissue was at least 2.5 times lower. These tissue-specific genes were used to annotate the CSF proteome.
Verification of Protein Tissue-specific Annotation—
The definition of tissue-specific genes/proteins may vary with tissues included in the study, with detailed processing methods, and when thresholds are changed. To validate our definition of tissue-specific genes, we checked the description and functions of some genes chosen at random. For instance, muscle-specific genes include myosin, actinin, troponin, creatine kinase, and calcium channels. And testis-specific genes are associated with terms such as spermatogenic, sperm-specific, or male-enhanced antigen. In addition, normal mouse gene expression data from Zapala et al. (39) covering 20 subregions of the brain plus 14 tissues from other parts of the adult mouse was used to confirm the human array-based analysis. Similar to human tissue data processing, we combined data from 20 subregions of the brain including striatum, cortex, cerebellum, etc. into one data set called "brain" to simplify the analysis using the maximum MAS5 signal of the 20 brain subregions computed for each probe set. As a result, 15 mouse tissues were considered for further analysis: adrenal, pituitary, testis, thymus, spinal cord, choroid plexus, retina, brown adipose tissue, white adipose tissue, kidney, liver, heart, muscle, spleen, and brain (supplemental Table S7). Corresponding human orthologous genes were inferred by human-mouse ortholog data provided by the Human Gene Organisation Gene Nomenclature Committee. Using the same tissue specificity-defining methods, 1333 tissue markers with human orthologous genes were observed (supplemental Table S8).
Summary of Statistical Analysis Procedures Used to Interrogate the Protein List
Statistical procedures were used to interrogate the synthesized protein list for three hypotheses that the experiments were designed to address: 1) that CSF is enriched for brain-specific proteins compared with plasma, 2) that brain-specific proteins change in CSF with HD development, and 3) that brain-specific protein changes in CSF are concordant with transcriptional changes in the brain.
To evaluate hypothesis 1, we performed Pearson's
2 tests on an over-representation analysis to compare the fraction of proteins annotated as being brain-specific in CSF and the Human Proteome Organization Plasma Proteome Project plasma data (results summarized in Table IV). To evaluate hypothesis 2, we used GSEA, which uses a non-parametric Wilcoxon test (40, 41) to compare the distribution of ratios between brain-specific and non-brain-specific proteins (see "Results"). For hypothesis 3, we coded all brain-derived proteins as up or down based on the sign of their sums of z-scores and all transcripts as up or down based on the sign of their log2 -fold changes of mRNA from the array and then applied Pearson's
2 tests to evaluate the association (results are summarized in Table V).
|
|
| RESULTS |
|---|
|
|
|---|
Analysis of Proteomics Data—
Search engine performance and PeptideProphet details (25) were inspected (supplemental Fig. S2A) to assure that sensitivity and error distributions were sufficient to determine correct and incorrect identifications from Labs 2, 3, and 4. The quality of data quantitation was determined by examining MA plots, histograms of light labeled/heavy labeled ratios at the peptide level, and histograms of the HD-early/control, HD-mid/control, and HD-mid/HD-early ratios at the protein level (supplemental Fig. S2, B–D). We observed that the distributions of logarithms of ratios are around 0 before normalization for all experiments except two interrogations from Lab 3. The log ratios for these two data sets were normalized to have a median of 0.
Lab 1 used a DIGE method (see "Experimental Procedures") and quantitated a large number of fluorescent spots. Only those spots found to have significantly different changes based on fluorescence between the HD and control CSF were selected for tandem MS analysis. As a result, a total of 19 unique proteins for 42 spots were identified based on the MS/MS data (Table II), each of which is a putative biomarker candidate. As a verification of these protein identifications ("Experimental Procedures"), we compared them with the results provided by Lab 1 using the Mascot search engine (supplemental Table S9) and found that there is a high consistency between the two results. In addition, proteins from more spots have been identified in this study. Lab 2 is the only lab that performed individual (non-pooled) interrogation. 335 confident protein groups (with ProteinProphet probability >0.9) were found and of these, 319 were quantitated using spectral counting (35) of highly confident peptide spectra ("Experimental Procedures"). Lab 3 pooled samples by disease status and analyzed by d0/d3 acetylation of the N terminus of the peptides. 263 confident protein groups were found after aligning the three experiments using ProteinProphet. After protein ratios were inferred from individual experiments, 161 were confidently quantitated. Lab 4 pooled samples by disease status and gender. In three pairwise comparisons, proteins in pooled CSF of two disease statuses were differentially labeled with light and heavy acrylamide on cysteine residues. Because a more extensive prefractionation strategy was used, these data resulted in identification of the majority of proteins (1179 groups) reported in this study of which 377 were confidently quantitated by the Q3 algorithm (27) ("Experimental Procedures"). In addition, Lab 4's label-free analysis using AMT methods (37) identified and quantitated 277 protein groups (Table II),
100 of which are not quantitated by the labeled approach (Table III).
|
|
The integrated analysis of all proteomics data was performed by a comprehensive search combining Lab 2, Lab 3, and Lab 4 labeled and unlabeled data. This resulted in a total of 12,430 peptides and 1574 high confidence protein groups identified (supplemental Table S3). The greatest overlaps in both proteins identified and proteins quantitated are between Lab 2 and Lab 4 labeled methods (Table III). From the across-lab comparison, 577 protein groups (corresponding to 762 genes) have been quantitated by at least one experimental method, and 301 protein groups (419 genes) have been quantitated by more than one method (Fig. 1). We synthesized protein ratios across laboratories by methods described under "Experimental Procedures." The resulting score (sum of z-scores) estimates the relative protein change in the HD versus normal CSF. Specifically a negative sum of z-scores indicates that the protein declines in the HD CSF when compared with normal, and a positive sum of z-scores indicates that the protein inclines. This score for each identified protein is shown in supplemental Table S3.
|
Several human brain gene expression profiling data have been published in recent years (29, 43). We selected an expression data set that was generated by Ge et al. (29) using a total of 36 types of normal human tissues. Data on four regions of brain (amygdala, corpus callosum, hippocampus, and thalamus) and other adult tissues were selected and combined with data from a comprehensive study carried out on four parts of normal human brain (caudate, cerebellum, and BA4 and BA9 cortex) (2) ("Experimental Procedures"). Based on the algorithm described under "Experimental Procedures," 1941 tissue-specific proteins are identified: 445 are brain- and 225 are liver-specific (supplemental Table S6). Integration with the CSF proteomics data found that 298 proteins/genes of 1574 are tissue markers, among which two major species are brain-specific (
30%) and liver-specific (
33%) (Table IV). One intriguing question is how representative these proteins are compared with plasma. Starting from a list of 3020 proteins identified with two or more peptides provided by the Human Proteome Organization Plasma Proteome Project, we aligned the 1941 tissue markers and found that of 414 proteins that are annotated as tissue-specific proteins 17% are brain- and 53% are liver-specific (Table IV). Therefore, brain-specific proteins are 1.8-fold enriched in CSF over plasma, whereas liver-specific proteins are about half as represented in CSF as in plasma. The Pearson's
2 test shows that brain-specific proteins significantly predominate in CSF compared with plasma (Table IV). This observation can also be confirmed by performing the same analysis using the normal mouse gene expression data from Zapala et al. (39) that covered 20 neural tissues from the adult mouse central nervous system plus 14 tissues from other parts of the body (data not shown).
Next we examined whether the brain-specific proteins are specific to any regions of the brain. Seven regions of the brain were included in this human tissue array data: amygdala, corpus callosum, hippocampus, thalamus, caudate, cerebellum, and cortex. Among 88 CSF proteins considered to be brain-specific, 29 are cerebellum-specific, 26 are cortex-specific, 12 are amygdala-specific, eight are caudate-specific, and 13 belong to the other regions. This suggests that more than 60% of these brain-specific proteins are specifically expressed in cerebellum and frontal cortex.
To have an overview of the relative abundance of these tissue-specific proteins/genes in CSF, we used the overall spectral count for each protein as a surrogate for the concentration. When the spectral counts were sorted in descending order, most of the liver-specific proteins (colored in green in Fig. 2A) show up at the top of the list. The brain-specific proteins (red) are distributed from upper middle to the bottom. The observation that liver-specific proteins are abundant makes sense because most CSF proteins come from plasma in which liver-derived proteins are highly represented.
|
Comparison of Protein Changes with Human HD Brain Transcriptional Profiling Data—
Gene expression changes in four brain regions of Huntington disease patients have been studied by Hodges et al. (2). Their results revealed that 21, 1, and 3% of probe sets were significantly differentially expressed in HD caudate, cerebellum, and BA4, respectively, and that no significant changes were found for BA9. An immediate question raised is what the concordance of changes in HD patients is between the proteomics and microarray studies. Among the genes that are significantly differentially expressed in HD caudate, cerebellum, and BA4 cortex,
665, 57, and 165 of their products are identified in CSF, respectively. Because the most significant mRNA changes occur in HD caudate and the expression profile of HD BA4 is strikingly similar to that of HD caudate (2), we compared all CSF protein changes with HD caudate data and additionally looked at cerebellum- and cortex-specific proteins when mRNA expression data were available.
To examine the concordance of changes, we used a sign test that compares the negative or positive signs of sums of z-scores from our proteomics study with the log2 -fold changes from the microarray study because in both data sets positive values indicate an increasing trend of proteins/genes in HD status and negative values indicate a decreasing trend. Therefore, proteins/genes with these two values in the same signs were considered concordant. Overall about half (111 of 227) of protein groups that have both sums of z-scores and significant mRNA changes are concordant (Table V). Among the 227 proteins, 47 have tissue annotations, and 16 are brain-specific. We found that 13 of 16 (81%) brain-specific proteins are concordant. However, only 42 and 47% of the proteins are concordant for non-brain tissue-specific proteins and proteins with unknown tissue origin, respectively. The
2 tests on the numbers of concordant and discordant for 1) brain-specific versus other tissue-specific proteins and 2) brain-specific versus those that are not tissue markers both gave a p value
0.024, indicating that the consistency of expression changes in HD status measured by a proteomics and genomics approach are significant for brain-specific proteins compared with other proteins (Table V). This concordance suggests that these proteins might be derived from neurons or glial cells in the brain. Moreover 11 of the 13 brain-specific genes that have concordant mRNA and protein changes show the trend of declining in HD, consistent with the above observation that brain-specific proteins tend to decline in HD samples.
The Most Significantly Changed Proteins in HD CSF Based on Proteomics Data—
With sums of z-scores that estimate protein changes between disease states across labs (Table VI), we were able to select 20 most increasing and 20 most decreasing proteins in HD CSF (relative to controls). This selection is naturally biased toward proteins that are observed by many labs and that have consistent trends of changes in HD CSF.
|
Integration of these most altered proteins in HD CSF with the tissue expression data shows that although most increasing proteins are liver-specific only three of the decreasing proteins are brain-specific (Table VII). This result is expected given that the method we used to generate this list of 40 proteins is biased toward more abundant proteins and given the above result suggesting that most liver-specific proteins are abundant and increased in HD CSF, whereas most brain-specific proteins are decreased, but not all of them are abundant enough to be selected. However, some of the decreased proteins may come from other substructures of the brain. For example, CHGB has the highest mRNA level in the mouse. TTR, ENPP2, and GGH are choroid plexus-specific genes according to the mouse array data and Allen Brain Atlas data. Moreover although not exclusively expressed in the brain, MEGF8, ALDOC, ENPP2, ENDOD1, and PRNP have the highest mRNA expression level in the brain. In addition, TTR, CHGB, and PAM have high mRNA expressions in the brain when compared with the median expression level. It is possible that a majority of these proteins found in CSF were derived from the brain.
|
Assessing Cross-lab Comparability—
As shown above, the labeled methods of Lab 2 and Lab 4 have the greatest overlap in both protein identification and quantitation (Table III), and 301 protein groups have been quantitated by more than one method (Fig. 1). Questions we addressed include the concordance of protein ratios and the concordance of each laboratory with the overall trends of brain- and liver-specific protein changes identified by the integrated analysis.
Scatter plots of relative protein abundance ratios between two disease states across different proteomics methods are shown in Fig. 3. The apparent low proteome-wide correlation of protein ratios should be expected given the nature of data we were interrogating. Specifically whenever high dimensional analyses such as proteomics (or genomics) are used for comparisons, as in our example, one expects that most proteins do not change between the two conditions (HD versus control). For these proteins the sources of variation are random (from experiments) and so will not correlate across laboratories. Only proteins that systematically changed as a result of differences in the case and the control are concordant. Thus, instead of inspecting the correlation of all data points in Fig. 3, one should focus on those ratios with higher magnitude and determine the concordance among them as these should be enriched for proteins having the systematic change. We can see that this trend of changes is rather consistent among laboratories for the 40 most altered proteins, especially for those increasing proteins that are more abundant and quantified by more laboratories (Table VI). Concordance was strongest among laboratories identifying the largest number of proteins.
|
|
| DISCUSSION |
|---|
|
|
|---|
From a practical standpoint, the protein changes identified in CSF of Huntington disease patients are candidate biomarkers that may be useful for tracking the HD progression or as surrogate end points in clinical trials. Integrating results between laboratories provided confirmation of both protein identification and quantitation. The universality of our findings will require further validation with additional CSF samples using specific technologies such as multiple reaction monitoring and ELISA rather than shotgun proteomics.
Before a candidate protein can serve as a biomarker, it is important to understand its role in the pathophysiology of the disease process. In the case of Huntington disease this is not yet possible because the exact sequence of pathological events downstream from expression of mutant huntingtin protein remains elusive. However, the predominant view is that the most clinically important signs and symptoms of HD relate to neurodegeneration and dysfunction in the brain. The hallmark neuropathology of HD is degeneration of medium spiny neurons in the striatum accompanied by extensive astrocytosis (47) and microgliosis (12). Gene expression profiling of postmortem human HD brain has been performed using striatal tissue as well as cerebellum and two cortical areas (2, 6). The known pathology and gene expression changes provide a perspective from which to view the proteomic changes.
To identify the most probable origin of the proteins detected in CSF, we queried a published microarray survey of gene expression in human tissues. Although the assumptions and methods used were somewhat crude and only constitute circumstantial evidence for the source tissue, this created a very biologically plausible list of "tissue markers." Integrating the expression data tissue markers with the CSF proteomics results revealed that many of the most abundant CSF proteins were probably derived from the liver. This is consistent with the known origin of CSF, which is a complex filtrate of blood produced by the choroid plexus. Importantly many proteins likely to have been derived from the brain were also detected in CSF. This was not seen in a re-examination of a protein component list for blood plasma. In plasma, liver-specific proteins are also highly over-represented, whereas very few brain-specific proteins are detected. The over-representation of brain-specific proteins in CSF supports the hypothesis that it is feasible to monitor some aspects of the health of the brain using CSF. This has important implications for biomarker discovery in neurological disease. However, proving that these proteins are derived from brain tissue is difficult and requires some type of labeling experiment.
The overall trends between changes in concentrations of brain-specific proteins and their corresponding mRNAs in HD brain tissues were consistent. This provides further support for the hypothesis that the concentration of some proteins in HD CSF may provide a window into the health of the brain and that CSF may be a fruitful source of HD biomarkers. However, to the extent that the candidate brain-specific proteins can be traced to a particular region of the brain, most appear to be cerebellum- and cortex-specific rather than specific to the striatum. This may be because the greater mass and surface area of the cortex and cerebellum provide more exposure to the CSF.
Two noticeable trends in the data were for the brain-specific proteins to decrease in concentration in HD CSF, whereas most proteins we detected with higher concentrations in HD CSF were functionally associated with the immune system. The latter may relate to astrocytosis, microgliosis, and neuroinflammation. Neuroinflammation is a common component of many neurodegenerative diseases, and these changes are unlikely to be specific to Huntington disease. Thus, although we did not detect a large number of protein changes that can be directly linked to striatal degeneration in HD, we did detect a general trend for brain-related proteins to decrease in HD CSF, and we detected increases in proteins that may reflect neuroinflammatory processes. Both trends are consistent with the known neuropathology of HD and bolster the biological relevance of our findings.
However, an interesting alternative mechanism may cause or contribute to the increase of blood-derived proteins and decrease of brain-derived proteins in HD CSF. Disruption of the blood-brain barrier is widely accepted in inflammatory conditions such as neurosystemic lupus erythematosus (48, 49) and multiple sclerosis (50, 51) and increasingly in conditions traditionally seen as degenerative with secondary neuroinflammation, like Alzheimer disease (52). Interestingly microglial activation and the presence of inflammatory cytokines could alter the properties of brain microvascular endothelial cells and the tight junctions that link them (53, 54), raising the possibility that the blood-brain barrier is also disturbed in Huntington disease patients. This hypothesis is consistent with the observed differences of brain- and blood-derived proteins in HD CSF. The integrity of the blood-brain barrier in HD can be examined by detecting changes of brain proteins in HD versus normal plasma or more directly using magnetic resonance imaging.
Because of the large source of variation among multiple laboratories, no general consensus can be made with regard to the specific ranking of proteins or their magnitude of change. In our experiments the statistical significance was not derived from establishing the significance of the top ranked individual proteins by traditional FDR (55) but rather based on interrogating the rankings based on protein sets using GSEA (28) methods where the gene sets were derived from external transcriptional data. We also found that results in this study are most strongly supported by laboratories that obtained the greatest depth of protein coverage. Our results suggest that additional biomarker studies should focus on designs that obtain the greatest depth of coverage and that interrogate the data analysis with externally derived hypotheses, perhaps from data integration. Our study could not have led to a positive finding without taking these advantages.
In summary, we provide a comprehensive profiling of the human HD CSF proteome. The integration of the proteomics data with various genomics data supports the idea of CSF as a rich source of biomarkers for neurological diseases. For Huntington disease in particular we derived a list of proteins that are altered in HD CSF and that have the potential to be used as a specific signature of HD progression.
| ACKNOWLEDGMENTS |
|---|
| FOOTNOTES |
|---|
Published, MCP Papers in Press, November 4, 2008, DOI 10.1074/mcp.M800231-MCP200
The comprehensive list of proteins has been deposited in the PRoteomics IDentifications (PRIDE) database under accession number 3701.
1 The abbreviations used are: HD, Huntington disease; CSF, cerebrospinal fluid; IPI, International Protein Index; AMT, accurate mass and time; LTQ-FT, hybrid linear ion trap-Fourier transform ICR mass spectrometer designed by Thermo Finnigan; LTQ OrbiTrap XL, mass spectrometer by Thermo Finnigan that is based on LTQ XLTM linear ion trap and the patented OrbitrapTM technology; HCT-Ultra ion trap, High-Capacity Trap (HCTTM) mass spectrometer system by Bruker Daltonics; mzXML, XML (extensible markup language)-based common file format for proteomics mass spectrometric data; FDR, false discovery rate; MAS5, Affymetrix microarray suite version 5 software; GSEA, gene set enrichment analysis; IS, internal standard; BA, Brodmann area; MA plots, intensity-dependent ratio plots. ![]()
* This work was supported by the High Q Foundation, Inc. The Canary Foundation provided financial support for publishing and disseminating data. ![]()
S The on-line version of this article (available at http://www.mcponline.org) contains supplemental material. ![]()

To whom correspondence should be addressed: Fred Hutchinson Cancer Research Center, 1100 Fairview Ave. N., M2-B230, Seattle, WA 98109. Tel.: 206-667-4612; Fax: 206-667-7264; E-mail: mmcintos{at}fhcrc.org
| REFERENCES |
|---|
|
|
|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| All ASBMB Journals | Journal of Biological Chemistry |
| Journal of Lipid Research | ASBMB Today |