Precise and Parallel Characterization of Coding Polymorphisms, Alternative Splicing, and Modifications in Human Proteins by Mass Spectrometry*S

The human proteome is a highly complex extension of the genome wherein a single gene often produces distinct protein forms due to alternative splicing, RNA editing, polymorphisms, and posttranslational modifications. Such biological variation compounded by the high sequence identity within gene families currently overwhelms the complete and routine characterization of mammalian proteins by MS. A new data base of human proteins (and their possible variants) was created and searched using tandem mass spectrometric data from intact proteins. This first application of top down MS/MS to wild-type human proteins demonstrates both gene-specific identification and the unambiguous characterization of multifaceted mass shifts (Δm values). Such Δm values found from the precise identification of 45 protein forms from HeLa cells reveal 34 coding single nucleotide polymorphisms, two protein forms from alternative splicing, and 12 diverse modifications (not including simple N-terminal processing), including a previously unknown phosphorylation at 10% occupancy. Automated protein identification was achieved with a median expectation value of 10−13 and often occurred simultaneously with dissection of diverse sources of protein variability as they occur in combination. Top down MS therefore has a bright future for enabling precise annotation of gene products expressed from the human genome by non-mass spectrometrists.

teome is highly complex, often encoding multiple protein forms for a given gene (1). This biological complexity poses a significant analytical and bioinformatic challenge to the detailed analysis of mammalian proteomes by MS and is exacerbated by the presence of gene families sharing high sequence identity (2,3). Protein modifications are often indicative of changes in cellular or tissue dynamics and therefore play central roles in regulation of the cell cycle or development of disease. Whether for new diagnostics or understanding molecular mechanisms in cell biology, protein identification using tryptic peptides has revolutionized the analysis of complex mixtures by mass spectrometry (1,4).
High throughput platforms based on MALDI (5) and ESI use MS/MS engines capable of spectral acquisition at a rate of Ͼ10 4 /week (6,7). Recent studies indicate significant inefficiencies associated with such large scale "bottom up" analyses in mammalian systems including imperfect enzymatic cleavage (8,9) and some MS/MS spectra requiring manual interpretation/validation for identification. Despite the lingering difficulties with peptide analysis, it provides the best and most general method for large scale protein identification today with information on nonsynonymous coding single nucleotide polymorphisms (cSNPs), alternative splicing (10), and PTMs challenging to obtain (2).
Recent developments by MacCoss et al. (11), Wu et al. (12), and Zhu et al. (13) use three proteases and multidimensional protein identification technology ("MudPIT") or isoelectric focusing, reversed-phase chromatography, and three mass spectrometers (13), respectively, to obtain mass information on ϳ70 -99% of the primary protein structure. Combining intact protein measurement with near exhaustive peptide analysis of five proteins from human cells allowed detection of N-terminal modifications and one alternatively spliced transcript (13). Although cSNP analysis of abundant blood proteins is possible (14), a general informatic strategy has yet to systematically integrate DNA and RNA level data with the MS-based interrogation of the human proteome. This is accomplished here using a data base of human proteins tailored for the "top down" MS approach by combinatorial consideration of protein variability during a search (i.e."shotgun annotation") (15). Although nucleic acid-based approaches represent the highest throughput and best overall methods for capturing information about SNPs, proteomics-based approaches allow cSNP genotyping concurrent to modification and splice variant identification.
The direct fragmentation of intact protein ions using FTMS now provides expectation values (Pscores) that are orders of magnitude better than searches based on tryptic peptides (16 -18), a far more efficient and robust reconstruction process for the primary structure of the mature protein, and detection of more diverse mass discrepancies (⌬m values) than targeted analysis approaches (e.g. for phosphopeptides). Major limitations for top down MS are difficulties in handling proteins Ͼ50 kDa routinely, low percent occupancy and multivalent PTMs (such as glycosylations) are difficult to detect, and only medium scale projects Ͻ200 proteins from microorganisms have been achieved (19). The top down MS/MS approach using standard fragmentation methods or electron capture dissociation (ECD) has provided 100% coverage with localization of basic PTMs for proteins in Bacteria (17,20), Archaea (16,17), yeast (19,21,22), and a plant (23).
Here we demonstrate unparalleled characterization of human (nuclear) proteins revealing seven different types of modifications in regulation and maturation including a novel phosphoprotein. This was achieved by extending the data base concept of shotgun annotation from a single human histone (15) to a proteomic scale and required the integration of diverse DNA, RNA, and protein level information. This work establishes the basis for routine application of top down MS to capture coding haplotypes within a gene and allele-specific splicing and modification patterns on a far greater number of human proteins.

EXPERIMENTAL PROCEDURES
Cell Culture and Lysate Fractionation-Human HeLa-S3 cells were grown to a density of 0.6 ϫ 10 6 cells/ml using Joklik's modified minimum essential medium and supplemented with 5% newborn calf serum. Cells were harvested using centrifugation at 2500 ϫ g and two washes in cold PBS. The nuclei were precipitated and isolated using detergent washes, and the cytosol was extracted (24). The isolated nuclei were then resuspended, and for a portion of the extract, the chromatin (including DNA-binding proteins) was precipitated by adding 0.5 N NaCl and 5 mM MgCl 2 . The proteins in solution were then loaded onto a prep cell (Bio-Rad) with a 12% T gel using an acid-labile surfactant (21). Proteins from the prep cell fractions were precipitated, treated at pH 2 for 1 h, and then separated using a Symmetry C 4 reversed-phase (RP) LC column (Waters, Milford, MA). For ϳ25% of identified proteins including barrier-to-autointegration factor (BAF) (see Fig. 4), the PF two-dimensional (2D) system (Beckman Coulter) was used for separation of proteins by pI, and then RPLC was carried out as outlined in the PF 2D manual.
ESI/Q-FTMS-Fractionated protein mixtures were suspended in ESI solution (49.5% MeOH, 49.5% H 2 O, and 1% formic acid) and centrifuged at 14,000 rpm for 10 min. Sample solutions were then loaded into a 96-well plate and automatically introduced to the mass spectrometer using the NanoMate 100 (Advion BioSciences, Ithaca, NY). Approximately 10 l of solution from each well were infused by automated nanospray into the heated metal capillary source. Typical samples enabled more than 40 min of stable nanospray providing sufficient time to acquire high quality broadband MS, threshold MS/ MS, and ECD MS/MS scans for two to three intact proteins per sample. In cases of insufficient fragmentation for precise localization of PTMs, excess sample was used in a more targeted fashion, and in some cases a greater number of scans were summed for collisionally activated dissociation (CAD), infrared multiphoton dissociation (IRMPD), or ECD.
The instrument used in this study was a custom 8.5-tesla Q-FTMS of the Marshall design (25). In the case of CAD external to the magnet bore, ions were selected using the quadrupole and fragmented using electrostatic acceleration (10 -45 V) into an octopole pressurized to ϳ10 millitorr with nitrogen gas. In the case of IRMPD or ECD, a SWIFT window 7 m/z wide was used. The isolated charge state was then dissociated using infrared laser radiation for 0.25-0.45 s (with a beam expander mounted in front of the laser, 40 watts, 75% power). After threshold dissociation, the quad-enhanced and SWIFT-isolated species was dissociated using ECD. Electrons were introduced to the cell for 100 -200 ms using a dispenser cathode 35 inches from the center of the magnet. The kinetic energy of the electrons was controlled by placing a 1-2-V bias potential on the filament of the dispenser cathode.
Automated Data Acquisition-A custom TCL automation script first acquired 5-10 broadband scans followed by a quadrupole marching experiment, and upon completion a modified THRASH algorithm (26) automatically determined M r values resulting in a peak list that was then used to select proteins for MS/MS analysis. The most abundant charge state of each protein was selectively accumulated using a notch-filtering quadrupole window 10 m/z wide automatically acquiring 5-10 scans. For targeted proteins, 25 or 50 scans of axial CAD or IRMPD were recorded to yield protein identifications. Automatically acquired ECD spectra were the sum of 100 scans.
Construction of the Custom Human Data Base-A highly annotated data base of human protein forms was created within ProSight Warehouse (27) using conflict sequences, splicing data, PTMs from Uni-Prot (28), SNP information from dbSNP, and a variety of manually entered data, such as new PTMs found in the primary literature. UniProt data bases were transformed from Swiss-Prot format by a custom data base loader created using Perl scripts and BioPerl libraries. To populate the data base with SNP information, dbSNP was queried for nonsynonymous, coding polymorphisms with an available corresponding protein accession number. The resultant information was populated to a local data base. Using a portion of dbSNP running locally, protein sequence information and function/ description were obtained. Using custom Perl scripts, the results were converted to the necessary ProSight Warehouse format. A data base loader application then extracted the protein information and populated ProSight Warehouse with all possible protein forms based on combinations of known variations for each gene product (15). The current number of protein forms in the human data base is 2,823,267 yielding a structured query language data base of 3.5 gigabytes with 17,333 proteins containing 1-10 cSNPs for subsequent searching using ProSight Retriever (29).
Data Analysis and Data Base Searching-Intact protein MS and MS/MS data were analyzed by THRASH (26) resulting in a protein list and fragment ion list that were uploaded onto the ProSight PTM (27) web server for data base searching (prosightptm.scs.uiuc.edu). The criteria for data base searching were generally a Ϯ2000 M r window and 5-20-ppm tolerance for fragment ions with default search options selected as follows: Met, on/off; acetyl, on/off; and SNPs, on. Pscores reported in this study were calculated as reported previously (16), and those Ͻ10 Ϫ3 required no manual validation of the identification result. Unless noted otherwise, M r and fragment ion mass values reported are for neutral, monoisotopic peaks (using external calibration), and protein identification numbers are UniProt primary accession numbers.

RESULTS AND DISCUSSION
Genotyping by Top Down MS-With one SNP present every ϳ1 kb in the human genome and 50,973 cSNPs currently known in dbSNP alone, well over half of human genes contain cSNPs, and top down MS/MS should enable robust genotyping even in the presence of PTMs. Fractions generated from a previously reported 2D separation of intact proteins (21) typically contain multiple proteins of varying abundance as in the ESI/Q-FTMS spectrum of Fig. 1a. Of the seven components, proteins of 6657.71 Da and 11,644.8 Da were selectively accumulated and fragmented by CAD and separately using ECD (spectra not shown). The CAD fragmentation data of Fig.  1b identified the 6.7-kDa component as a mitochondrial proteolipid (Pscore, 4 ϫ 10 Ϫ7 ) containing a known cSNP encoding a I9V residue change (⌬m ϭ 14.02 Da). Only the Ile-9 allele was observed with an intact mass error of 18 ppm. The 11.6-kDa component was identified from the Fig. 1c MS/MS data to be calgizzarin S100C (Pscore, 1 ϫ 10 Ϫ12 ). The calgizzarin gene contains a cSNP translating to a 1-Da variability (E36K), readily resolved for the Glu-36 allele observed in the background of N-terminal methionine loss/acetylation (overall 0.6-ppm error). This illustrates the efficiency of intact protein MS/MS for genotyping cSNPs, a feat not often possible using digestion-based approaches. Determination of minihaplotypes in coding regions (i.e. the co-occurrence of multiple alleles in a coding sequence) should also be possible using endogenous material itself instead of in vitro produced/artificial peptides from PCR products (30).
Gene-specific Identification and Genotyping of a Modified Protein-Two-dimensional fractionation of a nuclear protein FIG. 1. Complete characterization of multiple cSNP-containing proteins from one fraction. a, partial ESI/Q-FT mass spectrum (10 scans) of an acidlabile surfactant PAGE/RPLC sample from human cells. b, tandem mass spectrum (50 scans) from collisional dissociation of a 6.7-kDa protein selectively accumulated and fragmented using the quadrupole enhancement to FTMS. c, tandem mass spectrum (50 scans, axial CAD) from dissociation of the 11.6-kDa species at 905 m/z. d and e, graphical fragment maps generated upon data base retrieval using the MS/MS spectra of proteins highlighted in a (insets). Tall and short markers represent fragment ions produced from CAD (b/y-type) and ECD (c/z ⅐ -type), respectively. The circle in e indicates N-terminal acetylation; the shaded residues occur at known cSNP sites. theo, theoretical; exp, experimental.  (Fig. 2d). This ⌬m is most likely acetylation of the N terminus, although this same modification at Lys-5 is formally possible. Thus, an automated data flow can now differentiate between posttranslationally modified and cSNP-containing isoforms even in highly conserved gene families.
Identification and Semiquantitative Analysis of Alternative Splice Variants-In a separate sample, Q-FTMS/MS analysis automatically identified a 11,977.9-Da protein as prothymosin ␣ (ProT␣; Fig. 3d). ProT␣ is encoded by six family members with high sequence homology (31,32). The family member observed contains four introns and from EST data is known to be alternatively spliced due to a rare GAGGAG motif that creates adjacent AG acceptor sites at the intron 2/exon 3 boundary (Fig. 3e) (33). In most tissues, ϳ10% of this mRNA  contains an extra GAG codon (encodes for an extra Glu) versus 90% of ProT␣ transcripts where the more 5Ј acceptor site is used, producing a form with one less residue (33). Upon examination of the broadband spectrum, both species were observed in a ϳ10:1 ratio of light versus heavy protein (Fig. 3a). The minor species was subsequently fragmented (Fig. 3c), and the extra Glu residue was precisely localized (Fig. 3d, right).
The presence of the GAGGAG motif was recognized as a possible acceptor site by only NetGene2 (www.cbs.dtu.dk/ services/NetGene2), one of five intron/exon prediction programs tested. Using BLAST to search human EST libraries (www.ncbi.nlm.nih.gov/dbEST), more than 1300 dbEST entries were attributed to ProT␣ with only ϳ150 matching the longer form, consistent with an earlier finding that the ϳ9:1 ratio of short:long is not tissue-specific (33). Also using BLAST, the GAGGAG motif at this locus was found only in primates. Neither rat nor mouse have the extra splice acceptor site and have evolved only the long form of the protein, which is actually the less favorable form in humans.
Identification of a Novel Phosphoprotein-As a last illustration of new advantages provided by the top down MS approach, the 10,191.1-Da BAF protein was identified in a nuclear extract and exhibited a ϩ79.95 Ϯ 0.05-Da satellite peak at ϳ10% occupancy consistent with phosphorylation (Fig.  4a). The data from automated MS/MS localized the phosphorylation to the 11 N-terminal residues. Manual MS/MS using electrons further confirmed a Met off/acetylated N terminus and narrowed the region of phosphorylation to Thr-2 or Ser-3 (Fig. 4b). This well studied protein directly binds to chromatin, is thought to be involved in attachment of chromatin to the inner nuclear membrane (34), and is not known to be modified. No other forms of this protein have been observed in adjacent fractions, and the pI change caused by phosphorylation is small enough to allow coelution of both forms in identical fractions during chromatofocusing and RPLC. With the two-dimensional fractionation behavior of this modified protein now known, detection of this protein from nuclear extracts was reproduced twice more. This now allows targeted studies on this protein from synchronized HeLa cells in a straightforward manner. Such a platform for biochemical interrogation of targeted proteins after RNA interference, chemical perturbation, or cell synchronization will be highly valuable for capturing a more detailed picture of functional regulation mechanisms involving PTM dynamics.
Summary of Findings and Outlook-Using a dual ion fragmentation approach to automatically analyze two to three small human proteins per fraction by top down MS/MS, 45 proteins were identified with a median probability score of 10 Ϫ13 (Table I). A main advantage of the top down strategy is that information on the entire primary structure of the mature protein is obtained, allowing reliable dissection and abundance measurements of highly related gene products from genetic or transcriptional variation and enzymatic modification. Due to the complimentary nature of ion fragmentation using electrons and collisions with gas, precise localization of PTMs, polymorphisms, and amino acids at splice junctions is indeed possible. For the identified proteins, 45% were found in forms not present in UniProt's Human Proteome Initiative, and ϳ40% contained SNPs for which only single alleles were observed. Over 85% of the identifications required no manual validation of the data base retrieval result; ⌬m localization sometimes improved upon inspection of the raw data. Characterization of closely related protein forms (e.g. different PTM isomers or SNP forms) sometimes required manual scrutiny of the output from ProSight PTM with the correct form yielding the highest score in the retrieval list in ϳ90% of cases.
The ability to automatically genotype cSNPs and characterize PTMs with gene-specific identifications is enabled by the new informatic strategy of shotgun annotation (15), the combinatorial consideration of diverse sources of ⌬m values. This strategy represents a major shift in curation philosophy for protein data bases (35), is well suited for a top down approach using FTMS, and recognizes that detailed information on SNPs, mutations (36), splice variants (37), and PTMs (38) will be increasingly known and even somewhat predictable (36). By embedding such variability tightly within a MS retrieval engine, the current study drastically improves identification metrics, enables known biological events to be characterized as they occur in combination, and allows unknown biology to be uncovered more efficiently. Shotgun annotation actually increases the quality of most retrievals by allowing more absolute mass values of fragment ions observed in a top down MS/MS experiment to match those values generated from protein forms housed in a data base. The examples highlighted here illustrate an overall process that can simply be called "proteotyping." The term proteotyping is akin to genotyping at the DNA level but captures all the variability of proteins as they occur in populations and change over time. Fragmentation of intact proteins represents an emergent method for "reverse annotation" of the human genome, and top down MS can now be embraced by organizations such as the Human Proteome Organization. istry, University of Illinois Urbana-Champaign, 39 RAL, 600 S. Matthews, Urbana, IL 61801. E-mail: Kelleher@scs.uiuc.edu.