|
Advertisement | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Molecular & Cellular Proteomics 7:981-994, 2008.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ABSTRACT |
|---|
|
|
|---|
30% of proteins that were identified by 2D PAGE-MALDI-TOF/TOF. Roughly 10% of all detected proteins were derived from hypothetical or predicted gene models or were entirely unannotated. Comparison of proteins expression by 2D DIGE revealed that proteins involved in energy production and transcription/translation were relatively more abundant at 72 hpf consistent with faster synthesis of cellular proteins during organismal growth at this time compared with 120 hpf. The data are accessible in a database that links protein identifications to existing resources including the Zebrafish Information Network database. This new resource should facilitate the selection of candidate proteins for targeted quantitation and refine systematic genetic network analysis in vertebrate development and biology.
Zebrafish genetics has been facilitated substantially by the increasing availability of genome sequence information (Sanger Institute's D. rerio sequencing project) as assembly of the 1.7-Gb genome sequence nears completion. The utility of the genome sequence increases with the quality of annotation of protein-encoding genes. Genome annotations are supported by alignments of experimentally documented transcript or protein sequences specific for the zebrafish genome, by alignments of homologous transcript or protein sequences, and ab initio by computational gene prediction (10). While computational predictions tend to be rather imprecise, many putative zebrafish genes have human orthologs, and large regions of synteny exist between human and zebrafish chromosomes underlining the relevance of the zebrafish to the analysis of higher vertebrates (11). However, conservation across species is not limited to protein-coding regions (12); thus, these annotations are not compelling. Currently
17,300 genes have been annotated based on the highest quality of evidence (species-specific transcript data such as expressed sequence tags),
2500 are unknown or predicted on the basis of evidence from closely related species, and
1500 genes are computationally predicted (Ensembl Assembly Zv7, April 2007 (13)). Direct detection of translated peptides by tandem mass spectrometry methods has proved valuable in the annotation of genomes as they allow prediction or refinement of gene structures and can resolve difficulties in annotating from cDNA alternative splice forms and overlapping gene sequences (14–18).
Genetic screens for perturbed phenotypes have generated numerous mutants with defects in pathways affecting development, physiology, and behavior (6). Many mutant phenotypes are reminiscent of human disease and have proved invaluable in deciphering signaling cascades implicated in cardiovascular (3, 19), renal (20), and gastrointestinal biology (21) and in cancer (22). Initial forward genetic screens have phenotyped embryos based on morphology as optical clarity during external zebrafish embryogenesis facilitates visual analysis during their rapid development (6). More recently, assays visualizing biological functions in vivo such as the metabolic processing of fluorescent marker lipids (23) or the regenerative response to injury (24) and assays detecting dysregulated gene expression (25) have discovered mutant phenotypes not primarily amenable to visual inspection. Metabolic screens based on the accurate quantitation of biomarkers (e.g. using highly sensitive MS) that have been performed in chemically mutated mice (26, 27) have yet to be applied to zebrafish (28). Studies into the genetics of protein expression are largely constrained by the availability of specific antibodies. Mass spectrometry-based proteomics methods have the potential to overcome these hurdles; however, this requires first an accurate characterization of proteins accessible to targeted quantitative analysis (29).
We applied mass spectrometric proteomics methodology and statistical analysis (30) to create profiles of proteins expressed during zebrafish embryonic development. A major problem in two-dimensional chromatography-tandem mass spectrometry ("shotgun") proteomics approaches is that many protein identifications have low reproducibility if the sensitivity of detection is not carefully balanced against rates of false identification error by using multiple replicates and rigorous statistics (31). Thus, we applied a novel empirical Bayesian algorithm to integrate data sets from multiple search programs and multiple biological replicates (30). The data are accessible in a fully searchable database (see supplemental data for the URL database link under Instructions for Downloading) that links protein identifications to existing resources including the Zebrafish Information Network (ZFIN)1 database (32).
| EXPERIMENTAL PROCEDURES |
|---|
|
|
|---|
Two-dimensional Chromatography and Mass Spectrometry—
Off-line fractionation and LC-MS/MS were performed as described previously (30). The sample was loaded onto a PolySulfoethyl A column (100 mm x 4.6 mm, 5 µm, 300 Å; PolyLC, Columbia, MD) at a flow rate of 0.2 ml/min with mobile phase A (10 mM ammonium formate, 25% acetonitrile, pH 3). A linear gradient for 80 min was run to 100% mobile phase B (500 mM ammonium formate, 25% acetonitrile, pH 6.8). Sixty 1.6-min fractions were collected with a Foxy Jr. (Dionex, Sunnyvale, CA) automated fraction collector. Fractions with low peptide concentration were combined to yield a total of 40 fractions, which were lyophilized and stored at –80 °C. Lyophilized peptides were reconstituted with 0.1% formic acid, 5% acetonitrile for reversed phase liquid chromatography onto a Thermo Finnigan liquid ion trap mass spectrometer (Thermo Finnigan, San Jose, CA) using electron spray ionization. Each fraction was injected onto a Vydac C18 column (Everest 150 mm x 1 mm inner diameter, 300 Å, 5 µm; Bodmann, Aston, PA) with mobile phase A (0.1% formic acid, 0.01% trichloroacetic acid in water). A gradient (30 µl/min) was run over 180 min from 3 to 70% mobile phase B (0.1% formic acid, 0.01% trichloroacetic acid in acetonitrile). Nitrogen was used as the sheath (75 p.s.i.) and auxiliary (10 units) gas with the heated capillary at 180 °C. The mass spectrometer was operated in a data-dependent MS/MS mode (m/z 300–2000) in which the top seven ions were subjected to fragmentation at 27% normalized CID energy. Dynamic mass exclusion was enabled with a repeat count of 2 every 45 s for a list size of 250.
Analysis of LC-MS/MS Data—
The data sets of each sample were searched against the International Protein Index (IPI) D. rerio protein sequence database (version 3.07; number of protein sequences, 45,388; number of amino acid residues, 23,104,717) for peptide sequences using two independent algorithms, SEQUEST 3.1 (ThermoFinnigan, San Jose, CA) and MASCOT 2.1.04 (Matrix Sciences, Boston, MA). Raw mass spectra were converted to DTA peak lists using BioWorks Browser 3.2 (ThermoFinnigan) with the following parameter settings: peptide mass range, 300–5000 Da; threshold, 10; precursor mass, ±1.4 Da; group scan, 1; minimum group count, 1; minimum ion count, 15. Searches specified that peptides should have a maximum of two internal tryptic cleavage sites with methionine oxidation and cysteine carbamidomethylation as possible modifications. SEQUEST searches specified that peptides should possess at least one tryptic terminus and used a peptide mass tolerance of ±1.4 Da and a fragment ion tolerance of 0. MASCOT searches specified tryptic digestion and used a peptide mass tolerance of ±1.5 Da and a fragment ion tolerance of ±0.1 Da. The search results were converted into pepXML format. Peptide identification probabilities for both SEQUEST and MASCOT searches were calculated by executing PeptideProphet as implemented in the Trans-Proteomic Pipeline version 2.8 (Institute for Systems Biology, Seattle, WA) (34). SEQUEST results were processed using the "-Ol" tag, which uses
Cn* values unchanged.
Results from both searches (SEQUEST and MASCOT) and all biological replicates were combined in a single statistical analysis of protein expression per developmental stage using Empirical Bayes Protein Identifier (EBP) 1.0 as described previously (30). Briefly EBP estimates both sensitivity and false identification rate and has been validated empirically for analysis of zebrafish liquid ion trap mass spectrometry data using a reversed/forward sequence database search approach (30). EBP combines the probabilities of correct peptide identification across multiple peptide searches using a function that returns the maximum probability from consensus identifications and penalizes non-consensual identifications. Different charge states of the same peptide are treated as a single identification. The statistical model parameters include protein length, estimated protein abundance, the size of the search database, and the number of peptide sequence identifications in the data set. For each protein in the database, an expression probability is estimated using an "expectation-maximization" algorithm. Replicates are integrated by simultaneously estimating multiple sets of model parameters. Peptides whose sequence matches multiple proteins are integrated in the analysis using "Occam's razor," a principle by which the smallest set of probable proteins is chosen that is sufficient to explain the peptide sequence identifications. When proteins cannot be reliably distinguished by unique peptides, they are reported as a protein group. EBP analyses were run on combined SEQUEST and MASCOT results and biological replicates for each developmental stage using its default settings except for the calculation of the number of trypsin digests per protein, which specified peptides with at least one tryptic terminus. The default settings specify that only peptide identifications with probabilities greater than 0.5 are used in the calculation of protein identification probabilities. This is equivalent to using only those spectra for which SEQUEST and MASCOT reached consensus for the most likely peptide identification. Only proteins with expression probabilities corresponding to a false identification rate of less than 0.01 (1%) were reported. This was equivalent to expression probabilities of p > 0.77 and p > 0.87 in the given data sets, 72 and 120 hpf (Fig. 1). Spectra of protein identifications that met these criteria but were based on a single unique peptide were manually inspected. Thirty-eight such protein identifications were excluded after inspection. All single unique peptide identifications that remained in the data set are summarized in supplemental Table 3 as specified by the MCP data submission guidelines. Graphical representations of the corresponding annotated MS/MS spectra were extracted from the Trans-Proteomic Pipeline (Institute for Systems Biology, Seattle, WA) and the EBP plug-in using the Ruby 1.8.5 scripting language. Thus, result files were parsed for the peptide sequence with the highest PeptideProphet probability score for each charge state. Up to three such spectra were extracted if multiple spectra shared the highest probability value. HTML pages were created to display the resulting spectra images and hyperlinks included in supplemental Table 3 that open these pages. These pages and the table are included in the compressed (zipped) file (Lucitt_Supplemental_Table_3.zip) that installs the table and correct subfolder structure for viewing the linked MS/MS spectra when unpacked.
|
Image Analysis—
Images were analyzed with Progenesis PG200 software (Nonlinear Dynamics, Newcastle, UK) according to the manufacturer's instructions applying a "cross-stain analysis" on the DIGE gels. Thus, two multiplex groups (groups of images derived from the same gel) were defined as follows: multiplex group one: gel 1, 72-hpf Cy3, 120-hpf Cy5, and internal control Cy2; multiplex group two: gel 2, 72-hpf Cy5, 120-hpf Cy3, and internal control Cy2). Three replicate groups were defined as replicate 1 (72-hpf Cy3 from gel 1 and 72-hpf Cy5 from gel 2), replicate 2 (120-hpf Cy5 from gel 1 and 120-hpf Cy3 from gel 2), and replicate 3 (internal standard labeled with Cy2 from gels 1 and 2). A reference gel was automatically selected by the software using the default settings and is based on an internal standard image. The maximum number of gels in which a spot was allowed absent within the replicate group parameters was selected to be 0. Spot detection, background subtraction, warping, matching, and normalization were all set at the default settings of the software. Where possible, unmatched spots were edited on each multiplex group based on a three-dimensional view of the spot, afterward normalization was restored, and the reference gels were updated. Differences in average normalized volume between 72 and 120 hpf of 3-fold or more were considered for protein identification.
Protein Identification from 2D Gels—
A total of 164 spots considered differentially regulated based on the above criteria and 379 spots that were considered not to be differentially regulated were excised for identification by MALDI-TOF/TOF. Spots of interest were robotically excised into 96-well plates using an Ettan Spot Picker (GE Healthcare). Gel plugs were washed with 100 µl of Milli-Q water for 15 min and three times with 100 µl of 25 mM ammonium bicarbonate, 50% ACN for 30 min while Vortex mixing. Plugs were then dehydrated in 100% ACN for 10 min and allowed to air dry. This was followed by reduction with 10 mM DTT in 50 mM ammonium bicarbonate at 60 °C for 30 min followed by alkylation with 100 mM iodoacetamide in 50 mM ammonium bicarbonate for 45 min at room temperature in the dark. Wash steps as mentioned above were repeated, and gel plugs were dehydrated with 100% ACN. Twenty micrograms of sequencing grade modified trypsin (Promega) was solubilized in 40 mM ammonium bicarbonate, 5% ACN to a concentration of 20 ng/µl. Ten microliters of the trypsin solution was added to each plug and allowed to rehydrate the gel plugs on ice for 30 min and then incubated at 37 °C overnight. Digestion buffer was removed to a new 96-well plate, and 50 µl of 1% TFA in 50% ACN was added to the gel plugs and sonicated for 30 min. This extract was removed and combined with the digestion buffer and dried in a SpeedVac concentrator (Jouan, RC1022, Thermo Savant, Milford, MA) for 45 min. Peptides were then resuspended in 15 µl of 0.5% TFA in Milli-Q water. Peptides were solid phase-extracted (Millipore reverse phase ZipTipC18) according to the manufacturer's instructions. Samples were eluted into a 96-well plate with 4 µl of a 0.1% TFA, 50% ACN solution.
One microliter of the eluate was premixed with 2 µl of
-cyano-4-hydroxycinnamic acid matrix (3 mg/ml in 10 mM ammonium phosphate, 50% acetonitrile, 0.1% TFA) and spotted in duplicate on a MALDI target plate (Opti-TOF® 192-well insert, Applied Biosystems, Foster City, CA). MALDI-TOF MS and tandem TOF/TOF MS were performed on a Voyager 4700 instrument (Applied Biosystems). Thus, two peptide mass fingerprint (PMF) spectra per gel spot were generated from separate MALDI plate wells. The spectra were acquired in the reflector mode by averaging 3000 laser shots per spectrum (mass range, 800–4000 Da; focus mass, 2000 Da). Spectra were smoothed (Gaussian filter width, 9; target resolution at 1300 m/z, 20,000) for internal calibration to trypsin autolytic peptides (m/z 842.510, 1045.564, 1940.935, 2211.105, 2239.136, 2299.179, and 2807.300) and only peaks that exceeded a signal-to-noise ratio of 100 (local noise window, 200 m/z) and a half-maximal width of 2.9 bins were considered. A minimum of two monoisotopic trypsin peaks were required to calibrate each spectrum to a mass accuracy within 20 ppm. Failure to meet these criteria resulted in the application of the external plate calibration that was performed prior to each run and required matching of six standard peptide ion masses (m/z 904.468, 1296.685, 1570.677, 2093.087, 2465.199, and 3657.929) from six calibration spots (4700 Mass Standard kit, catalog number 4333604, Applied Biosystems). The laser power for PMF acquisition was adjusted to produce an average intensity of
7000 for the m/z 2093.087 standard ion (ACTH-(1–17)) across the six calibration spots prior to each run. Data-dependent MS/MS analyses, using PSD on one replicate PMF spectra set and CID on the other replicate, was performed on the 15 most abundant peptide ions (excluding trypsin autolysis ions) to generate amino acid sequence information. MS/MS spectra were integrated over 3000 laser shots in the 1-kV positive ion mode with the metastable suppressor turned on. Air at the medium gas pressure setting (1.25 x 10–6 torr) was used as the collision gas in the CID on mode. An internal calibration of MS/MS spectra was attempted on at least two ions of the immonium ion series and the y1 ions of arginine, lysine, and histidine (m/z: Arg immonium, 70.066, 87.081, 100.088, and 112.088; Arg y1, 175.119; Lys immonium, 84.081, 101.108, and 129.103; Lys y1, 147.113; His immonium, 110.072 and 138.067) or reverted to the external calibration, which was performed prior to each PSD or CID run on four fragmentation ions of Glu1-fibrinopeptide B (m/z: precursor, 1570.677; y1, 175.120; y4, 480.257; y6, 684.347; y9, 1056.475). The laser intensity for the MS/MS spectra acquisition was adjusted to an intensity of
4000 of the y9 ion (m/z 1056.475) prior to each run.
The Global Proteome Server (GPS) Explorer 3.5 build 321 software (Applied Biosystems) was used to extract peaks from raw spectra using the following settings: MS peak filtering: mass range, 800–4000 Da; minimum signal-to-noise ratio, 10; peak density filter, 50 peaks/200 Da; maximum number of peaks, 65; MS/MS peak filtering: mass range, 60–20 Da below precursor mass; minimum signal-to-noise ratio, 10; peak density filter, 50 peaks/200 Da; maximum number of peaks, 65. A combined MS peptide fingerprint and MS/MS peptide sequencing search was performed against the IPI D. rerio version 3.07 database (number of protein sequences, 45,388; number of amino acid residues, 23,104,717) using the MASCOT 2.1.04 search algorithm. These searches specified trypsin as the digestion enzyme and allowed for carbamidomethylation of cysteine, partial oxidation of methionine residues (all variable modifications), and one missed trypsin cleavage. The monoisotopic precursor ion tolerance was set to 50 ppm, and the MS/MS ion tolerance was set to 0.05 Da. The output was limited to the 10 best hits. MS/MS peptide spectra with a minimum ion score confidence interval
95% were accepted; this was equivalent to a median ion score cutoff of
27 in the data set. Protein identifications were accepted with a statistically significant MASCOT protein search score
65 that corresponded to an error probability of p < 0.01 in our data set. All possible protein identifications from replicate analyses that met the above criteria were reported for each gel spot. However, the protein identification with the highest score was selected in the case of redundant protein identifications.
The raw mass spectra were exported to mzXML using the PzMsXML script (Nathan Edwards, University of Maryland Center for Bioinformatics and Computational Biology, College Park, MD). Annotated PMF spectra were produced by combining the spectra file formats for raw and processed peaks, mzXML and Mascot generic format, respectively. Peak annotations and modification information for identified peptides was extracted from the result summary table (supplemental Table 4). The Ruby scripting language was used to parse these files, send the spectra and annotation information to the R statistical tool (The R Project for Statistical Computing) for plotting, and creation of the HTML result pages. Annotated spectra for the tandem mass spectrometry experiments were obtained by transforming the dynamic MASCOT Web pages into static content using Ruby and saved locally to the drive. Hyperlinks to both PMF and MS/MS pages are included in supplemental Table 4. These pages and the table are included in a compressed (zipped) file (Lucitt_Supplemental_Table_4.zip) that installs the table and correct subfolder structure for viewing the linked spectra when unpacked.
Protein Classification—
Proteins were classified using the Gene Ontology (GO) functional annotations for cellular component, molecular function, and biological process (35). Annotation categories were taken from level three in the GO trees. GO enrichment analysis was conducted by calculating for each category the probability that the number of annotations in the protein list could have arisen by chance, assuming an underlying hypergeometric distribution (36). Pathway analysis (Ingenuity Systems, Redwood City, CA) was used to search for enrichment of proteins in canonical and metabolic signaling pathways. IPI protein sequences were BLAST searched against the RefSeq human and mouse protein sequence databases, and BLAST results were used for mapping in the Ingenuity Systems pathway database.
Zebrafish Proteomics Database—
A database was constructed parsing IPI records, GenBankTM records, the NCBI taxonomy, and GO ontology into a BioSQL relational database schema. The schema was extended to include the experimental result data and key word searching capabilities as well as optimized for the Web application. The web site itself was constructed using the Ruby on Rails web application framework. The zebrafish proteomics database can be accessed on line (see supplemental data for the URL database link under Instructions for Downloading).
| RESULTS |
|---|
|
|
|---|
This approach identified 1112 unique proteins at 72 hpf and 867 unique proteins at 120 hpf with false identification rates of less then 1% and sensitivities of 91.7 and 88.2% (Fig. 1 and supplemental Tables 1, 2, and 3). An additional 45 proteins at 72 hpf and 31 proteins at 120 hpf were as likely to be expressed but were indistinguishable from homologous proteins based on the peptide evidence. Eighty-six percent of the identified proteins at 72 hpf and 82% at 120 hpf were based on gene models derived from transcript or protein sequences specific for the zebrafish genome (Ensembl Assembly Zv7, April 2007 (13)). Hypothetical proteins or proteins predicted by comparison with other genomes constituted 13% of the detected proteins at 72 hpf and 17% at 120 hpf.
The separation of proteins at the peptide level by 2D LC-MS/MS may preclude discrimination of homologous proteins, such as distinct isoforms or modified forms of a protein. Gel-based proteomics techniques allow more readily the distinction of similar proteins based on their migration pattern in the electrical field. Thus, we ran protein samples from both developmental stages on 2D gels. In total, 348 unique proteins at 72 hpf and 317 unique proteins at 120 hpf were identified from 2D gels using MALDI-TOF/TOF tandem mass spectrometry with an error probability of less than 0.01 (Fig. 2 and supplemental Table 4). Approximately 85% of the detected proteins were annotated at the highest level of quality, and 15% were hypothetical or predicted proteins with similarity to sequences of other species (Ensembl Assembly Zv7, April 2007 (13)).
|
N2 and βB1). Proteins regulating developmental processes included proteins such as β-catenins 1 and 2, staufen homolog 2, and kelch-like 1.
About 50% of all identified proteins were detected at both embryonic stages (Fig. 3). Interestingly a large fraction of proteins were exclusively identified by 2D PAGE (248 and 231 at 72 and 120 hpf, respectively) but not by 2D LC-MS/MS. Only a total of 97 proteins at 72 hpf and 86 at 120 hpf were detected by both 2D PAGE and 2D LC-MS/MS at either stage (Fig. 3). Thus,
70% of the proteins identified on gels were not detected by 2D LC-MS/MS. A potential explanation for this discrepancy is that the gel separation method may favor proteins that are not well digested in solution.
|
|
Proteins involved in energy production and metabolism (muscle-specific creatine kinase, IPI00507087; L-lactate dehydrogenase B chain, IPI00495855; aldolase c fructose-bisphosphate, IPI00490850; creatine kinase mitochondrial 2, IPI00485952; and enolase 3 protein, IPI00490877) were between 5- and 50-fold less abundant at 120 hpf than at 72 hpf. Transcription/translation proteins (ribosomal protein SA, IPI00508284; RNA binding motif protein 4, IPI00615024; eukaryotic translation initiation factor 3, IPI00496845; and heterogeneous nuclear ribonucleoprotein that binds to nascent RNA polymerase II transcripts and plays a role in both transcript-specific packaging and alternative splicing of pre-mRNAs, IPI00491050) were 4–18-fold more abundant at 72 hpf than at 120 hpf, consistent with the faster synthesis of cellular proteins during organismal growth at the earlier developmental stage. Similarly prohibitin (IPI00480889), chaperonin containing TCP1 subunit 5 (IPI00498630), and heat shock protein Hspd1 (IPI00508003), which are all involved in cell cycle control, were more abundant (5–10-fold) at 72 hpf than at 120 hpf. All four lens proteins (crystallin
N2, IPI00495773; β-crystallin B1, IPI00502990; β-crystallin A4, IPI00490966; and β-crystallin A1–2, IPI00504818) were more prominent (6–19-fold) at the earlier stage relative to total protein, consistent with the relatively larger volume of the eyes in comparison with the whole organism at this stage. Three embryonic proteins, novel β-type globin (IPI00513361), novel protein similar to embryonic 1 (IPI00502256), and novel protein similar to vertebrate apurinic/apyrimidinic endonuclease (APEX) (IPI00498781), were also decreased at 120 hpf (3–9-fold).
Apart from the structural proteins, several proteins that were more abundant at 120 hpf than at 72 hpf fell into the hypothetical/predicted or unknown categories. Other up-regulated proteins were ubiquitin C (IPI00619743; 9-fold) and ribosomal protein S27a (IPI00510181; 9-fold), which are involved in targeting cellular proteins for degradation (Table I).
Pathway Membership of Detected Proteins—
We sought to categorize protein identifications using a pathway enrichment analysis based on a database that describes signaling pathways and network relationship (Ingenuity Systems). As this resource does not include D. rerio sequences, we performed a BLASTP search of all protein identifications against the RefSeq human and mouse databases. This resulted in 731 RefSeq cross-references for 120 hpf (83% of total identifications from IPI) and 883 for 72 hpf (80% of total identification from IPI). The network and pathway database contained functional annotations for 461 proteins at 120 hpf (55% of original protein identifications) and 561 proteins at 72 hpf (51% of original identifications). These were analyzed for their membership in collated canonical and metabolic signaling pathways. A total of 163 proteins for each time point, 120 and 72 hpf, were mapped to a canonical signaling pathway. Metabolic pathway information existed for 366 proteins for 120 hpf and 397 proteins for 72 hpf (supplemental Tables 5 and 6).
The distribution of proteins mapped to the canonical pathways is illustrated in Fig. 4. Pathways with the most contributing proteins were related to calcium, integrin, extracellular signal-regulated kinase (ERK)/mitogen-activated protein kinase, and vascular endothelial growth factor signaling. Proteins associated with morphogenesis such as the WNT/β-catenin pathway were less prominent but present at both 120 and 72 hpf (not shown). Indeed the developmental stages were relatively similar in their functional associations with the notable exception of the calcium signaling pathway, which was detected at 120 hpf (23 proteins) but was absent at 72 hpf.
|
BLASTP analysis of protein sequences identified from 2D PAGE resulted in 235 RefSeq cross-references for 120 hpf (73%) and 262 cross-references for 72 hpf (75%). A total of 86 proteins for 120 hpf (26%) and 96 proteins at 72 hpf (27%) were annotated with canonical and signaling pathway information in the Ingenuity Systems pathway database. Fewer pathways were detected compared with 2D LC-MS/MS identifications. However, the pathway profile was again similar between the stages (Fig. 4 and supplemental Fig. 1).
A second approach to the functional analysis of the embryonic zebrafish proteins was based on GO annotations. Annotated proteins were categorized into the broad GO classes biological process, molecular function, and cellular component. A graphical representation of these categories for each embryonic stage is shown in Fig. 5 and supplemental Fig. 2 for 72 hpf and supplemental Fig. 3 for 120 hpf. The protein identifications at both 120 and 72 hpf were categorized similarly using GO. Approximately 30% of proteins had GO annotation to cellular metabolism, 13% had GO annotation to transport, 4% had GO annotation to cell organization and biogenesis, and 2% had GO annotation to translation/transcription and signal transduction. About 1% of proteins were annotated with functions relating to morphogenesis, cell differentiation, and development. Structural molecule activity, a category that is often over-represented in proteomics analyses, was associated with 8% of the proteins at 120 hpf and 6% at 72 hpf. Enzyme inhibitor activity, signal transducer activity, and motor activity all had 1% or less associated proteins at both 120 and 72 hpf. Cellular component information was unavailable for 60% of proteins. Association to organelles was
20%, association to intracellular localization was 10%, and association to membrane localization was 8% at both stages.
|
Zebrafish Proteomics Database—
A relational database was constructed by combination of IPI, GenBank, the NCBI taxonomy, GO assignments, and the experimental data. A Web application interface was developed for user-friendly ad hoc queries of the sequence annotation as well as perusal of the experimental and data mining results. The zebrafish proteomics database is available on line. The source code and associated database are also available for download at the site (see supplemental data for the URL database link under Instructions for Downloading).
| DISCUSSION |
|---|
|
|
|---|
We used a bimodal strategy, shotgun proteomics and 2D PAGE, to analyze expressed zebrafish proteins during two advanced stages of development. The approach was informed by a number of considerations including (i) the selection of developmental stages that will likely be screened for dysregulated protein expression in large scale mutagenesis or chemical screens, (ii) the simplicity of sample preparation and analytical methodology (mindful of their possible adaptation to high throughput screening), (iii) the eminent quality of mass spectrometric protein identifications, (iv) the accessibility of the protein data in a fully searchable database and their integration with existing genomics and genetics resources provided to the zebrafish research community through the ZFIN database (32), and (v) the comprehensive categorization of proteins by functional classes to facilitate the selection of candidate proteins, for example for the design of targeted quantitative mass spectrometry assays (29). Our approach yielded 1384 unique proteins at 72 or 120 hpf by 2D LC-ESI-MS/MS and 477 unique proteins at 72 or 120 hpf by 2D PAGE-MALDI-TOF/TOF, which showed an overlap of about 30%. More unique proteins were identified by LC-MS/MS at 72 hpf (1112 proteins) than at 120 hpf (867 proteins), although equal amounts of protein were analyzed. While the precise cause for this disparity remains unknown, this may reflect developmental differences in the complexity of the protein samples favoring a larger number of high confidence identifications in the less developed embryo.
Shotgun proteomics studies produce hundreds of thousands of mass spectra derived from fragmented peptide ions that include the amino acid sequence information. These sequences are typically inferred automatically by matching the fragmentation ion spectra to theoretical or empirical spectra in peptide sequence databases, a process that may result in the generation of large numbers of false identifications (31) even if the error rate is small. Abundant proteins are generally detected with a high degree of confidence, whereas many lower abundance protein identifications have low reproducibility (31). Combining multiple biological and/or technical replicates improves markedly the sensitivity of protein identifications from MS/MS data, and both sensitivity and specificity are enhanced further by a combined spectra analysis with complementary database search algorithms, such as SEQUEST plus MASCOT (30, 47). These algorithms use distinct assessments of the plausibility of the inferred peptide sequences. SEQUEST uses heuristic metrics of the fit between the measured and theoretical spectra; MASCOT estimates a probability that the fragment ions in the spectrum could be generated by chance from sequences in the database. Thus, we integrated data from both biological replicates and such "orthogonal" dual data base searches to maximize the confidence in the protein identifications. We used an algorithm, EBP, that estimates accurately sensitivity and the false identification rate for complex protein samples and uses a function to combine data from replicate samples and multiple database search algorithms (30). This strategy identified from 327,906 potential peptide sequence calls proteins at both developmental stages with false identification rates of less than 1%.
Roughly 10% of the identified proteins were hypothetical, predicted, or unannotated proteins (based on Ensembl Assembly Zv7, April 2007 (13)). Thus, our analysis provides evidence in support of high quality annotations for numerous additional zebrafish genes. Comprehensive annotation of protein-coding genes remains challenging (48). Indeed most annotation pipelines, including the automated Ensembl pipeline (13) and the manual Vertebrate Genome Annotation (VEGA) pipeline (49) used for zebrafish, require confirmation of computationally predicted genes by independent evidence and/or manual validation for highest quality annotation. The additional evidence can take the form of experimentally documented transcription within the species (such as expressed sequence tags) or conservation across distant organisms. Indeed computational gene finding increasingly incorporates cross-species homology between closely related genomes to produce improved gene models (50). However, this evidence may not be sufficient as conservation across species is not limited to protein-coding regions (12). Similarly alternative splicing and overlapping genes present particularly complex annotation problems, and indeed, some estimates suggest that the majority of genes undergo alternative splicing (51, 52). Direct mass spectrometric identification of peptide sequences can resolve such ambiguities (40) as it generates an independent line of evidence at the translational level with error sources distinct from nucleotide-based approaches. Here we provided rigorous peptide level evidence for
1500 genes of the zebrafish genome. Given an estimated total number of zebrafish genes of about 22,000–23,000, our absolute coverage is in the range of 7%. As such peptide level information adds an important element to the repertoire of available annotation evidence (18) a larger collaborative effort, similar to the human and mouse proteome organizations, to enhance markedly the coverage of the zebrafish proteome seems warranted. Indeed one might expect that proteomics studies might accompany vertebrate genome annotation projects routinely in the future just like microbial genome annotation projects have been accelerated by proteomics investigations (14, 15).
The applicability of protein mass spectrometry in zebrafish has been explored in 2D-PAGE based investigations, which have detected small sets of zebrafish proteins mass spectrometrically with variable measures of error control (53–57). A more comprehensive shotgun proteomics investigation focused on the identification of protein in livers of adult fish as a resource for toxicological studies (58). Here we provide comprehensive 2D PAGE and shotgun proteomics identifications obtained from whole zebrafish embryos. A likely application of these data is the design of targeted quantitative mass spectrometry assays that might be used in mutagenesis or chemical screens. Targeted quantitative proteomics approaches are based on stable isotope dilution LC/multiple reaction monitoring-MS methods (59–64) and allow quantitation of compounds with high specificity and precision (29, 63). Although the sensitivity of immunoassays for protein quantitation still exceeds most mass spectrometry assays, stable isotope analogs normalize for selective losses of analytes as well as act as carriers for trace amounts of analytes subjected to complex isolation procedures (29). The development of such assays, however, requires detailed prior knowledge of (i) which proteins are expressed in the sample and are reliably detectable, (ii) which peptides are uniquely diagnostic for the targeted proteins, and (iii) under which experimental conditions they can be detected (i.e. in which strong cation exchange chromatography fractions do they elute). Our data set provides this information.
Analysis of known or predicted protein functions within the data set revealed a similar representation of protein classes relevant for cell function at both developmental stages, including proteins related to structure, transcription/translation, cell cycle, nucleotide metabolism, ion transport, carbohydrate, energy, and lipid metabolism. Proteins associated with organ systems such as central nervous system, heart, and skeletal muscle were represented in both stages. Analysis of relative expression changes revealed that proteins involved in energy production, transcription/translation, and cell cycle control were relatively more abundant at 72 hpf, consistent with the faster synthesis of cellular proteins during organismal growth at this time compared with 120 hpf. A large fraction, greater than 50% for both data sets, lacked functional information such as Gene Ontology classifications. More than 40 and 60% had no information relating to "molecular and biological function" and "cellular processes," respectively. Thus, all protein assignments at both stages were aligned with sequences in the human or mouse RefSeq protein databases. This revealed alignment of 83% at 120 hpf and 80% at 72 hpf. However, these homologous sequences also had poor annotation in Ingenuity Pathway analysis. Thus, many of the identified proteins may represent candidates for the exploration of their protein functions.
Our large scale proteome analysis of embryonic zebrafish tissue revealed expression of previously uncharacterized proteins and detected developmentally regulated functional protein classes. The data are accessible on line in a fully searchable database that links protein identifications to existing resources including the ZFIN database (32). This new resource should allow the selection of candidate proteins for targeted quantitation (29) in mutagenesis and chemical screens and may refine systematic genetic network analysis in vertebrate development and biology.
| FOOTNOTES |
|---|
Published, MCP Papers in Press, January 22, 2008, DOI 10.1074/mcp.M700382-MCP200
1 The abbreviations used are: ZFIN, Zebrafish Information Network; EBP, Empirical Bayes Protein Identifier; GO, Gene Ontology; hpf, hours postfertilization; HSP, heat shock protein; IPI, International Protein Index; 2D, two-dimensional; PMF, peptide mass fingerprint; ACTH, adrenocorticotropic hormone; BLAST, Basic Local Alignment Search Tool. ![]()
* This work was supported, in whole or in part, by National Institutes of Health Grant HL 62250 (to G. A. F.). This work was also supported by the American Heart Association (National Scientist Development Grant 0430148N to T. G.) and the Higher Education Authority of Ireland (to M. B. L.). The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. ![]()
Raw mass spectral data have been submitted to the Proteomics Identifications (PRIDE) database (1) under accession numbers 2030–2045, 2047–2128, 2154–2353, 2383–2413.
S The on-line version of this article (available at http://www.mcponline.org) contains supplemental material. ![]()
** The A. N. Richards Professor of Pharmacology. ![]()

The McNeil Professor in Translational Medicine and Therapeutics. To whom correspondence may be addressed: Inst. for Translational Medicine and Therapeutics, University of Pennsylvania, 153 Johnson Pavilion, 3620 Hamilton Walk, Philadelphia, PA 19104-6084. Tel.: 215-898-1184; E-mail: garret{at}spirit.gcrc.upenn.edu

To whom correspondence may be addressed: Inst. for Translational Medicine and Therapeutics, University of Pennsylvania, 153 Johnson Pavilion, 3620 Hamilton Walk, Philadelphia, PA 19104-6084. Tel.: 215-898-1184; E-mail: tilo{at}itmat.upenn.edu
| REFERENCES |
|---|
|
|
|---|
A-crystallin expression prevents
-crystallin insolubility and cataract formation in the zebrafish cloche mutant lens.
Development 133, 2585
–2593
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| All ASBMB Journals | Journal of Biological Chemistry |
| Journal of Lipid Research | ASBMB Today |