|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Molecular & Cellular Proteomics 2:1297-1305, 2003.
© 2003 by The American Society for Biochemistry and Molecular Biology, Inc.

From Cellzome AG, Meyerhofstrasse 1, 69117 Heidelberg, Germany
| ABSTRACT |
|---|
|
|
|---|
1:2000 can be envisaged for this approach. Identified proteins range from 4553 kDa in size, cover the pI range between 3.4 and 12.8, and include 255 proteins with predicted transmembrane domains. Repeated analysis of peptides derived from the same gel band showed that the reproducibility of nanocapillary liquid chromatography-MS/MS of such complex mixtures is about 6070% suggesting that a particular analytical experiment would need to be repeated about three times to arrive at a representative estimate of the set of highly abundant proteins in a given proteome. Given its technical simplicity, sensitivity, and wealth of generated information, we have adopted this experimental approach to characterize every cell line and tissue that is the subject of experimentation in our laboratory. The combined dataset for the six cell lines consists of 2341 non-redundant human proteins and thus constitutes one of the largest collections of human proteomic data published to date.
The large scale analysis of gene expression patterns required for the establishment of such a database can in theory be performed at the level of mRNA or proteins; however, mRNA-based approaches such as high density oligonucleotide arrays (1) and serial analysis of gene expression (2) are generally not set up to provide a measure for the absolute number of transcripts of a particular gene. In addition, it has been shown in yeast that the correlation between mRNA and protein levels is insufficient to predict protein expression levels from quantitative mRNA data (3). As a result, direct protein profiling is required to approach a description of the protein composition of a particular cell under a given set of physiological conditions.
Nowadays, large scale protein profiling experiments are dominated by the use of mass spectrometry (MS)1 techniques for protein identification such as peptide mass fingerprinting (4) or peptide sequencing using tandem mass spectrometry and database searching (57). To reduce sample complexity, a number of separation steps on the protein or peptide level are usually employed prior to protein identification. Traditionally, two-dimensional polyacrylamide gel electrophoresis (2D PAGE) has been the method of choice in expression proteomics studies because it features very high protein separation capacity and because the visualized protein spot pattern provides additional semiquantitative information for comparative studies. Many 2D reference gels of bacterial, fungal, and mammalian proteomes are publicly available (www.expasy.org/ch2d/), but even though high quality 2D gels display thousands of protein spots, these typically comprise multiple isoforms and technical artifacts of no more than 200300 highly abundant different gene products identified by microsequencing or mass spectrometry. Conference reports suggest that this figure might increase to 1000 genes if subcellular fractionation is used prior to 2D PAGE, but such data has not yet appeared in the public domain. Further limitations of 2D PAGE for general protein profiling of cell lines and tissues include the fact that the technique is still quite technically demanding and the observation that proteins with extreme biochemical properties (size, isoelectric point) as well as certain protein classes (e.g. transmembrane proteins) have been notoriously underrepresented in datasets from standard profiling experiments (8).
More recently, omitting a protein separation step altogether and instead combining two orthogonal chromatography steps and tandem mass spectrometry (LC/LC-MS/MS) to resolve and analyze peptides following in-solution digestion of proteins (9) has become increasingly popular. Controlling this technology is not trivial, but recent reports indicate great potential. For example, 1484 yeast proteins (9), 2363 rice proteins (10), 2415 plasmodium proteins (11), and 1610 rat proteins (12) have been identified using this approach, suggesting that 2D LC-MS/MS is much more powerful for the global characterization of proteins from cells and tissue compared with 2D PAGE. However, such high proteome coverage does come at a price, notably sensitivity. Typically, shotgun identification experiments require 0.55 mg of total protein. Although this will often not present a real limitation when working with cell lines or tissue, such quantities might not be available from needle biopsy material, collected by laser capture microdissection or from biochemical fractionation of cellular contents. From an analytical point of view, abandoning separation on the protein level and instead employing 2D LC separations of complex peptide mixtures might suffer from the fact that peptides from highly abundant proteins take up a large proportion of the available analytical space in both chromatographic dimensions, which should limit the dynamic range available for protein identification and thus might compromise the identification of proteins with low expression levels. This problem has been addressed by reducing sample complexity using isotope-coded affinity tag (ICAT) technology (13). Although the selective modification and isolation of cysteine-containing peptides in combination with LC/LC-MS/MS analysis allowed the identification of 986 yeast proteins (14), this three-dimensional chromatographic approach did not really constitute a benefit over the 2D approach reported by Peng et al. (15), who managed to identify 1504 yeast proteins without the use of the isotope-coded affinity tag step. Possibly, this is the result of restricting the accessible proteome space to proteins that contain a cysteine residue within a tryptic fragment in the mass range typically used for protein identification (8002500 Da). About 14% of all human proteins do not have any cysteine-containing peptides in that mass range, and a further 19% only possess one such peptide. In addition, the amount of starting material for an isotope-coded affinity tag experiment is also in the mg range, which restricts the application of the approach to experiments in which relatively large amounts of proteins are available.
An approach that constitutes an interesting compromise between the advantages and shortcomings of the methods mentioned above is a combination of 1D PAGE protein separation and nanocapillary LC-MS/MS analysis (GeLC-MS/MS) of in-gel-generated peptides for protein identification. It is technically simple in nature and combines decent protein separation capability that also captures those proteins typically not accessible via 2D PAGE (notably large proteins and those with transmembrane domains) and the well established excellent sensitivity of gel-based protein identification using mass spectrometry for samples of low complexity (16). Recent examples from the literature indicate that this approach might also be viably applied for the analysis of complex protein mixtures as shown by the identification of 1289 plasmodium proteins (17) and of 271 proteins from the nucleolus (18). In this paper, we report on the application of GeLC-MS/MS to the analysis of core proteomes of six human immortalized cell lines as well as on the effect of subcellular fractionation on the total yield of protein identifications. The dataset currently includes cytoplasmic core proteomes of HEK293 (embryonic kidney cells), SKNBE2 (neuroblastoma), SW480 (colon carcinoma), HeLa (cervical adenocarcinoma), HeLaS3 (a clonal derivative of HeLa), and HepG2 (hepatocellular carcinoma) cells as well as a nuclear preparation from HEK293 cells. We identified between 268 and 1111 non-redundant cytoplasmic proteins for each cell line utilizing only 50 µg of total protein. The resulting information has proven very valuable in our laboratory for discriminating between specific protein signals and background in various types of affinity-based protein purification/identification experiments.
| EXPERIMENTAL PROCEDURES |
|---|
|
|
|---|
Mass Spectrometry and Data Analysis
One-third of the total tryptic digest sample was subjected to 60 min of data-dependent nanocapillary reversed-phase LC-MS/MS analysis using self-packed 75-µm inner diameter columns (Reprosil, Maisch) on nanoLC systems (CapLC, Waters; Ultimate, LC Packings) coupled to quadrupole time-of-flight (QTOF) instruments (QTOF Ultima, QTOF Micro, QTOF II, Waters). Data-dependent acquisition was performed using three MS/MS channels and no exclusion time. Where possible, measurements were repeated up to two times using the remaining amount of sample.
Proteins were identified by automated database searching (Mascot Daemon, Matrix Science) against an in-house curated version of the monthly updated International Protein Index protein sequence database (IPI, versions 2.52.18, European Bioinformatics Institute, www.ebi.ac.uk/IPI/). This minimally redundant yet maximally complete compilation of entries from Swiss-Prot, TrEMBL, RefSeq, and Ensembl was complemented with frequently observed non-human contaminants and viral proteins expressed by the examined immortalized cell lines (e.g. expression of E1B protein from human adenovirus type 5 (Swiss-Prot accession number P03243) in HEK293 cells). Search parameters were as follows: MS and MS/MS tolerance of 0.4 Da, tryptic specificity allowing for up to 3 missed cleavages and K/R-P cleavages, fixed modification of carbamidomethylation of cysteine, and variable modification of oxidation of methionine. Results were read into a Oracle database for further data analysis and comparison of identified protein sets using standard database querying tools. To ensure the highest possible quality of identification, protein hits with Mascot scores between 30 and 80 were evaluated by visual examination of the corresponding MS/MS spectra. In the case of identifications of multiple protein database entries based on the same set of peptides, only a single entry (highest molecular weight) was considered. Identifications based on peptides comprising a subset of a larger set of peptides used for identification of another database entry were not included. Comparative analysis of protein identification sets was performed on the level of database accession numbers as well as on clusters of 97% sequence identity to minimize effects caused by the ongoing revision of protein database content. Prediction of transmembrane domains was done using TMHMM 2.0 (20), and isoelectric points were calculated using EMBOSS (21).
| RESULTS AND DISCUSSION |
|---|
|
|
|---|
|
The identified core proteomes cover a wide range of biochemically diverse proteins in terms of size (4.4553.0 kDa), isoelectric point (3.4112.76), and presence of up to 16 transmembrane domains (TMDs, Tables I and S1S7). Fig. 1 shows a comparison between the distribution of protein size in the protein sequence database and that obtained for all proteins identified here. The two distributions match quite closely for proteins larger than 20 kDa indicating that GeLC-MS/MS does provide a fair representation of the protein content of a cell. However, very small proteins tend to be underrepresented, and that trend becomes more severe the smaller the protein is. This can of course partly be attributed to the fact that small proteins may run off the bottom of the gel. However, our analysis shows that proteins as small as 4 kDa were identified, and the same effect is indeed observed in shotgun protein identification studies that do not involve the use of gels (Ref. 9 and data not shown). An alternative explanation is of course that small proteins yield few peptides for identification, but a third factor should also be considered, a potential overrepresentation of small proteins in the sequence database. This can arise from the fact that the protein sequence database contains a large number of entries that are predicted from genomic sequences. Small genes are difficult to assign with good confidence from genomic sequences (22), and as a result, many small coding sequences in the protein database might be overpredictions of the software tools used for gene finding. When analyzing our dataset for the presence of transmembrane proteins, we find that 11% of all proteins (255) contain at least one transmembrane domain and that 101 proteins contain more than one such domain. Although one might have expected a higher incidence of transmembrane proteins (20% of all proteins in IPI contain at least one predicted TMD), it is noteworthy that a relatively large number of transmembrane proteins could be identified here without any specific enrichment for this class of proteins.
|
40 x 38 = 1520 proteins in a single LC-MS/MS analysis of all gel slices. In practice, we have generally not identified more than 800 proteins in one such analysis. There are two primary reasons why there is a discrepancy between the theoretically possible number of protein identifications and the one that was actually achieved. First, there is no even distribution of the number of proteins across the gel (see Fig. 1). Second, not all proteins are equally abundant. Both aspects are illustrated in Fig. 2. The number of identified proteins/gel band closely follows the protein size distribution shown in Fig. 1. Therefore, comparatively few proteins are identified in the high mass range of the gel, and many proteins are identified in the 2050-kDa range. However, the number of peptides with which a particular protein is identified remains relatively constant across the gel, which means that the mass spectrometer is probably undersampling in the high mass region of the gel because few peptides make it above the detection threshold, whereas the opposite is the case in the lower parts of the gel, where many more peptide ion signals are competing for measurement time than the instrument can handle. This corresponds well with the observation that the number of acquired peptide spectra per LC-MS/MS run tends to peak in the 50-kDa region. However, the ratio of spectra that contributed to verified protein identification events and the number of all acquired spectra do not show a clear trend over the molecular weight range (data not shown). We, like many investigators before us, find that a large proportion of MS/MS spectra do not match a peptide sequence in the database with any reasonable confidence. The average ratio of MS/MS spectra used for protein identification from all different cell lines was 41% with a standard deviation of 19%. There are many possible explanations for this observation. For example, if the intensity thresholds for switching between MS and MS/MS are low, the mass spectrometer will select for precursor ions of which the intensity will never be sufficient to produce an informative MS/MS spectrum. Another important point to consider is the separation performance of the liquid chromatography system. If the chromatographic peak width is small, some ions that are detected in an MS survey spectrum might have an intensity that is too low by the time it is their turn to be fragmented in a particular data acquisition scheme. Hence, the number of precursor ions that are subjected to collision-induced dissociation between survey spectra should be adjusted to the separation characteristics of the LC system. Overall, these factors might introduce a slight bias against smaller yet more abundant proteins, which is probably not so much the case in shotgun identification approaches because the bias introduced by the gel does not apply. One way to deal with this potential bias is to repeat the measurement of the same sample several times to arrive at a fairer representation of the proteins present.
|
We observed a somewhat unexpected but significant source of variation in reproducibility that is related to changes in the sequence database against which MS data is searched. Although two rounds of searching of identical primary LC-MS/MS data (here, single analysis of HEK293 cytoplasm) against different versions of the IPI sequence database (versions 2.5 and 2.12) produced a relatively constant number of non-redundant identifications (447 versus 431 IPI entries, 440 versus 424 clusters of 97% identity), the overlap was only 63% on the level of accession numbers and 83% on the cluster level. This is not a feature of this particular sequence database, but rather all sequence databases are constantly changing in size and content as the result of the increasing amounts of available genomic sequence information and more (or less) refined annotation of such sequences. As a result, protein identifications might "disappear" because of the removal of the sequence against which the spectrum was originally matched. Re-searching of primary mass spectrometry data against updated protein databases is not practical when it comes to large and rapidly growing datasets. In a lucky case, this would remove false positive identifications, but there are numerous examples from our laboratory where true protein sequences have been identified with one database version and have disappeared in the next release. Freezing a certain database version for extended periods of time is also not a viable solution because of the risk of missing newly identified gene products. One way out of that problem is to keep an in-house curated master database version in which novel entries of a new database release are added without removing entries that have been eliminated from the new database release. This procedure is, however, not practical for every laboratory because it requires the presence of an appropriate IT infrastructure. In addition, small but biochemically and functionally irrelevant changes in protein sequences might lead to an ever increasing database. An alternative is to cluster database entries based on sequence homology (here, 97% identity) to confine sequences for a particular gene product to a small number, which in turn helps to recognize proteins that are functionally identical. The downside of this approach is that homology clustering often cannot cope with subtle differences in amino acid sequence that are easily differentiated by MS, and the approach may thus lead to loss of information on e.g. isoforms or splice variants of gene products. We are currently evaluating clustering algorithms that reflect the information content of primary MS data more accurately and should provide a means to alleviate some of the problems mentioned above.
Compilation of Proteins into Abundance Lists
All proteins identified in the cell lines analyzed are compiled in Tables S1S7 and sorted by the sum of all Mascot protein scores with which a particular protein was found in the experiments. Mass spectrometry of proteins and peptides is not a quantitative method as such. Therefore, it is difficult to assess the abundance of a particular protein from MS data per se. Nevertheless, there are several empiric indications that can help to estimate the relative abundance of a protein in a mixture. In general, the higher the amount of protein, the higher the MS signal intensity, number of sequenced peptides/sequence coverage, and Mascot score. Any of these measures work fairly well as long as proteins in the mixture are of similar size. If that is not the case, signal intensities are probably a more suitable measure for abundance (24). However, it is not very practical to compute these values for all peptides in very large datasets and acquired on different types of instrumentation. Therefore, we have chosen to rely on the sum of all Mascot protein scores with which a particular protein was found in a 1D PAGE gel lane. We are aware that this tends to overestimate the abundance of large proteins because these generate more peptides that contribute to the total Mascot protein score. However, we believe that this is not a fundamental problem because the protein lists of Tables S1S7 are compiled from GeLC-MS/MS data for use with GeLC-MS/MS data; thus, the abundance assessment will always be made relative to where a protein was found on a gel. For example, clathrin heavy chain 1 (IPI00024067.1, 192 kDa) appears with a total Mascot score of 2161, whereas Ras GTPase-activating-like protein IQGAP1 (IPI00009342.1), a protein of almost identical size (189 kDa), is listed with a score of 250 (Table S1). Hence it is fair to assume that the former protein is much more abundant than the latter. We have also found that the bias toward larger proteins introduced by using total Mascot score for abundance classification is not overly large. When plotting total Mascot score versus protein size for all proteins of a cell line (data not shown), the maximum of the score distribution is found at about 40 kDa, which is not that far off the maximum of the size distribution of all identified proteins of 2530 kDa (Fig. 1). In addition the "usual suspects" for abundant proteins, like cytoskeletal proteins, chaperones, and metabolic enzymes, populate the top positions of all lists.
Reported total Mascot scores in Tables S1S7 span values between 30 and 10,000. The respective range for individual gel slices (LC-MS/MS runs) is
303000. Clearly, all proteins in Tables S1S7 are abundant cellular constituents, but the range of scores suggests that the absolute abundance range might be in the order of 1:100. Again, it should be stressed that the purpose of the sorting exercise is to reflect an overall trend in the dataset rather than an abundance criterion for individual proteins. Nevertheless, proteins that are closer to the top of Table S1 can be expected to be more prone to contributing to unspecific protein background in affinity purification experiments than those at the bottom of the list. It should also be kept in mind that even when identifying over a thousand proteins from a cell line or tissue using GeLC-MS/MS (or LC/LC-MS/MS for that matter), the depth of proteome coverage is still very limited considering that it is estimated that the expression levels of proteins in cells span six orders of magnitude.
Comparative Analysis of Datasets from Different Cell Lines
It is a well known fact that different cell lines and tissues express different sets of proteins, which at least in part is a reflection of their specialized physiological roles. However, one would expect that a considerable fraction of that part of the proteome that is primarily occupied with housekeeping functions ought to be abundantly expressed in all cell types. When comparing our data (protein clusters of 97% sequence identity), we were surprised by the fact that only 104 of the total number of 1543 non-redundant cytoplasmic protein clusters were found to be shared between all cell lines analyzed. The overall overlaps between cell lines (Table II) are largely independent of whether proteins are among the top 10, 20, 50, or 100 proteins on the list in Tables S1S7 (data not shown). 50% of the proteins are shared among HEK293, SKNBE2, and SW480 cells. For SKNBE2 and SW480 cells, the respective overlap is
30%. Cell line-specific cytoplasmic clusters (i.e. proteins that were exclusively found in a single cell line) account for 36% of all identified proteins in HEK293, 23% in SKNBE2, 22% in HeLaS3, 21% in HepG2, 13% in SW480, and 6% in HeLa cells. Obviously, the fraction of unique identifications increases with the size of the datasets for a particular cell line, and it should be borne in mind that the absence of a protein in a particular dataset may just be because the expression level of that protein was below the detection limit of the system. Nevertheless, there are numerous examples of highly abundant proteins that are found in a cell line-specific manner. Examples include the dopamine monooxygenase precursor (IPI00012890.1) and the neuron-specific calcium-binding protein hippocalcin (IPI00145135.1), which were both found exclusively in the neuroblastoma cell line SKNBE2. Given the substantial differences in relative expression levels of proteins in core proteomes of different cell lines, we would recommend analyzing every cell line or tissue used for affinity purification approaches in the way described here for the purpose of defining abundance-related protein background that is meaningful for a particular source of protein.
|
|
On a cautionary note, one should be aware of the fact that protein abundance is only one, albeit important, factor contributing to the phenomenon of protein background in affinity purification and other experiments. It by no means excludes the possibility that a very abundant protein might serve a very specific function. For example, HSP90 appears repeatedly among the highest scoring proteins in our datasets, but at the same time, it has been shown to be required in conjunction with cdc37 for activation of the I
B kinase complex in tumor necrosis factor-
signaling (26).
We have adopted the use of GeLC-MS/MS to characterize every cell line and tissue that is subject to biochemical experimentation in our laboratory. The resulting information has been integrated into our in-house protein sequence database and has become an important tool for the assessment of experimental results of biochemical experiments. Although the data of this study is available as supplementary information and can be used for the same purpose, we would also encourage our colleagues to go through this worthwhile and limited effort for their favorite source of protein and to share the data with the scientific community.
|
| ACKNOWLEDGMENTS |
|---|
| FOOTNOTES |
|---|
Published, MCP Papers in Press, October 6, 2003, DOI 10.1074/mcp.M300087-MCP200
1 The abbreviations used are: MS, mass spectrometry; MS/MS, tandem mass spectrometry; 1D, one-dimensional; 2D, two-dimensional; LC, liquid chromatography; LC/LC, two orthogonal liquid chromatography steps; GeLC-MS/MS, 1D PAGE protein separation followed by nanocapillary LC-MS/MS analysis; IPI, International Protein Index; TMD, transmembrane domain; bis-Tris, 2-[bis(2-hydroxyethyl)amino]-2-(hydroxymethyl)propane-1,3-diol. ![]()
2 M. Schirle, M.-A. Heurtier, and B. Kuster, manuscript in preparation. ![]()
* The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. ![]()
S The on-line version of this article (available at http://www.mcponline.org) contains Tables S1S7. ![]()
To whom correspondence should be addressed. E-mail: bernhard.kuster{at}cellzome.com
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
K. Hojlund, Z. Yi, H. Hwang, B. Bowen, N. Lefort, C. R. Flynn, P. Langlais, S. T. Weintraub, and L. J. Mandarino Characterization of the Human Skeletal Muscle Proteome by One-dimensional Gel Electrophoresis and HPLC-ESI-MS/MS Mol. Cell. Proteomics, February 1, 2008; 7(2): 257 - 267. [Abstract] [Full Text] [PDF] |
||||
![]() |
U. Rix, O. Hantschel, G. Durnberger, L. L. Remsing Rix, M. Planyavsky, N. V. Fernbach, I. Kaupe, K. L. Bennett, P. Valent, J. Colinge, et al. Chemical proteomic profiles of the BCR-ABL inhibitors imatinib, nilotinib, and dasatinib reveal novel kinase and nonkinase targets Blood, December 1, 2007; 110(12): 4055 - 4063. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Adachi, C. Kumar, Y. Zhang, and M. Mann In-depth Analysis of the Adipocyte Proteome by Mass Spectrometry and Bioinformatics Mol. Cell. Proteomics, July 1, 2007; 6(7): 1257 - 1273. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. N. Williams, P. J. Skipp, H. E. Humphries, M. Christodoulides, C. D. O'Connor, and J. E. Heckels Proteomic Analysis of Outer Membranes and Vesicles from Wild-Type Serogroup B Neisseria meningitidis and a Lipopolysaccharide-Deficient Mutant Infect. Immun., March 1, 2007; 75(3): 1364 - 1372. [Abstract] [Full Text] [PDF] |
||||
![]() |
C.-M. Cruciat, C. Hassler, and C. Niehrs The MRH Protein Erlectin Is a Member of the Endoplasmic Reticulum Synexpression Group and Functions in N-Glycan Recognition J. Biol. Chem., May 5, 2006; 281(18): 12986 - 12993. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Rezaul, L. Wu, V. Mayya, S.-I. Hwang, and D. Han A Systematic Characterization of Mitochondrial Proteome from Human T Leukemia Cells Mol. Cell. Proteomics, February 1, 2005; 4(2): 169 - 181. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Brajenovic, G. Joberty, B. Kuster, T. Bouwmeester, and G. Drewes Comprehensive Proteomic Analysis of Human Par Protein Complexes Reveals an Interconnected Protein Network J. Biol. Chem., March 26, 2004; 279(13): 12804 - 12811. [Abstract] [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||