Exponentially Modified Protein Abundance Index (emPAI) for Estimation of Absolute Protein Amount in Proteomics by the Number of Sequenced Peptides per Protein*S

To estimate absolute protein contents in complex mixtures, we previously defined a protein abundance index (PAI) as the number of observed peptides divided by the number of observable peptides per protein (Rappsilber, J., Ryder, U., Lamond, A. I., and Mann, M. (2002) Large-scale proteomic analysis of the human spliceosome. Genome. Res. 12, 1231–1245). Here we report that PAI values obtained at different concentrations of serum albumin show a linear relationship with the logarithm of protein concentration in LC-MS/MS experiments. This was also the case for 46 proteins in a mouse whole cell lysate. For absolute quantitation, PAI was converted to exponentially modified PAI (emPAI), equal to 10PAI minus one, which is proportional to protein content in a protein mixture. For the 46 proteins in the whole lysate, the deviation percentages of the emPAI-based abundances from the actual values were within 63% on average, similar or better than determination of abundance by protein staining. emPAI was applied to comprehensive protein expression analysis and to a comparison study between gene and protein expression in a human cancer cell line, HCT116. The values of emPAI are easily calculated and add important quantitation information to proteomic experiments; therefore we suggest that they should be reported in large scale proteomic identification projects.

This was also the case for 46 proteins in a mouse whole cell lysate. For absolute quantitation, PAI was converted to exponentially modified PAI (emPAI), equal to 10 PAI minus one, which is proportional to protein content in a protein mixture. For the 46 proteins in the whole lysate, the deviation percentages of the emPAI-based abundances from the actual values were within 63% on average, similar or better than determination of abundance by protein staining. emPAI was applied to comprehensive protein expression analysis and to a comparison study between gene and protein expression in a human cancer cell line, HCT116. The values of emPAI are easily calculated and add important quantitation information to proteomic experiments; therefore we suggest that they should be reported in large scale proteomic identification projects.

Molecular & Cellular Proteomics 4:1265-1272, 2005.
Proteomic LC-MS approaches combined with genome-annotated databases currently allow identification of thousands of proteins from complex mixtures (1). Approaches have also been developed for relative quantitation using stable isotope labeling (2)(3)(4). Recently not only comprehensive quantitation studies between two states (5, 6) but also protein-protein (7,8), protein-peptide (9), and protein-drug (10) interaction anal-yses have been reported. So far, however, a comprehensive approach for determining protein concentrations in one sample has not been established. Protein concentrations are one of the most basic and important parameters in quantitative proteomics because the kinetics/dynamics of the cellular proteome is described in terms of changes in the concentrations of proteins in particular compartments. Biological experiments often require at least some information on protein abundance for correct interpretation. In the past, crude quantitative information could be drawn from the intensity of gel staining in comparison to a known amount of marker protein.
However, in complex mixture analysis, individual proteins cannot be stained individually, and usually all information about protein abundance is lost. So far, isotope-labeled synthetic peptides have been used as internal standards for absolute quantitation of particular proteins of interest (11,12). This approach is in principle applicable to comprehensive analysis but is hampered by the high cost of isotope-labeled peptides as well as the difficulty of quantitative digestion of proteins in-gel (13).
Even a single nano-LC-MS/MS analysis can easily generate a long list of identified proteins with the help of database searching, and additional information can be extracted, such as the hit rank in identification, the probability score, the number of identified peptides per protein, ion counts of identified peptides, LC retention times, and so on. Qualitatively some parameters, such as the hit rank, the score, and the number of peptides per protein (14), can be considered as indicators for protein abundance in the analyzed sample. Among them, the integrated ion counts of the peptides identifying each protein would be the most direct parameter to describe the abundance and has been used to compare protein expression in different states (15). However, a mass spectrometer is not as versatile as an absorbance detector because of limited linearity and possibly because of background and ionization suppression effects (16). Therefore, it is necessary to normalize these parameters to obtain at least approximate quantitative information. The first approach to achieve this, to our knowledge, was to use the number of peptides per protein normalized by the theoretical number of peptides (so-called protein abundance index (PAI) 1 ), and this was applied to human spliceosome complex analysis (17). PAI is superior to the number of identified peptides because it takes account of the fact that, for the same number of molecules, larger proteins and proteins with many peptides in the preferred mass range for mass spectrometry will generate more observed peptides. Independently Sanders et al. (18) developed a similar index. The number of peptides, spectra counts, or the total of the peptide probability scores in LC/ LC-MS/MS analysis can also be used for relative quantitation (19 -21). Here we further develop the PAI strategy to determine protein abundance from nano-LC-MS/MS experiments and present a modified form, emPAI, the exponential form of PAI minus one. In experiments with labeled complex mixtures, into which we spiked in synthetic peptides, we show emPAI to be roughly proportional to protein abundance.

MATERIALS AND METHODS
Preparation of Cell Lysate-RPMI 1640 medium (Invitrogen) containing [ 13 C 6 ]Leu (Cambridge Isotope Laboratories, Andover, MA) was prepared according to the SILAC protocol of Ong et al. (4). Mouse neuroblastoma neuro2a cells were cultured in this medium for [ 13 C 6 ]Leu labeling. Whole cells were lysed using ultrasonication in the presence of a protease inhibitor mixture (Roche Diagnostics). HCT116-C9 cells were grown in a normal RPMI 1640 culture medium as described previously (10). Whole proteins were extracted with 5 ml of M-PER (Pierce) containing protease inhibitor mixture and 5 mM dithiothreitol.
Preparation of Peptide Mixtures for LC-MS/MS-Proteins from cell lysates were dried and resuspended in 50 mM Tris-HCl buffer (pH 9.0) containing 8 M urea. These mixtures were subsequently reduced, alkylated, and digested with Lys-C (Wako, Osaka, Japan) and trypsin (Promega, Madison, WI) as described previously (6). Digested solutions were acidified with TFA and desalted and concentrated using C 18 StageTips (22), which were prepared by a fully automated instrument (Nikkyo Technos, Tokyo, Japan) with Empore C 18 disks (3M, St. Paul, MN). Peptide fractionation by strong cation exchange chromatography (SCX) was performed using SCX-StageTip with 0 -500 mM five-step ammonium acetate salt elution (23), and the resultant fractions were desalted using C 18 StageTips prior to LC-MS/MS analysis. Candidates for peptide synthesis containing at least one leucine and one tyrosine were selected, considering the sequences of tryptic peptides from proteins expressed in neuro2a cells. Peptides containing methionine and tryptophan were removed to avoid oxidation problems during sample preparation. In addition, peptides with double basic residues were removed, considering the frequency of missed cleavage by trypsin. The selected 54 peptides were synthesized using a Shimadzu PSSM-8 (Kyoto, Japan) with Fmoc (N-(9fluorenyl)methoxycarbonyl) chemistry and were purified by preparative HPLC. Amino acid analysis, peptide mass measurement, and HPLC-UV were carried out for purity and structure elucidation. A solution containing equal amounts of each peptide was spiked into the peptide mixtures from neuro2a cells. Three different amounts were spiked so that peak intensity ratios of unlabeled peptides to labeled peptides were between 0.2 and 5.
Nano-LC-MS/MS Analysis-All samples were analyzed by nano-LC-MS/MS using a QSTAR Pulsar i (AB/MDS-Sciex, Toronto, Canada), a Finnigan LCQ Advantage (Thermoelectron, San Jose, CA) or a Finnigan LTQ (Thermoelectron) system equipped with a Shimadzu LC10A gradient pump and an HTC-PAL autosampler (CTC Analytics AG, Zwingen, Switzerland) equipped with Valco C2 valves with 150-m ports. ReproSil C 18 materials (3 m, Dr. Maisch, Ammerbuch, Germany) were packed into a self-pulled needle (100-m inner diameter, 6-m opening, 150-mm length) with a nitrogen-pressurized column loader cell (Nikkyo) to prepare an analytical column needle with "stone-arch" frit (24). A Teflon-coated column holder (Nikkyo) was mounted on an x-y-z nanospray interface (Proxeon, Odense, Denmark), and a Valco metal connector with a magnet was used to hold the column needle and to set the appropriate spray position. The injection volume was 2.5 l, and the flow rate was 250 nl/min after a tee splitter. The mobile phases consisted of A (0.5% acetic acid) and B (0.5% acetic acid and 80% acetonitrile). The three-step linear gradient of 5-10% B in 5 min, 10 -30% in 60 min, 30 -100% in 5 min, and 100% in 10 min was used throughout this study. A spray voltage of 2400 V was applied via the metal connector as described previously (24). For QSTAR experiments with the faster scan mode, MS scans were performed for 1 s to select three intense peaks, and subsequently three MS/MS scans were performed for 0.55 s each. An information-dependent acquisition function was active for 3 min to exclude the previously scanned parent ions. Data Analysis-A Mascot version 1.9 database search engine (Matrix Sciences, London, UK) was used for protein identification against the Swiss-Prot protein database. The allowed number of missed cleavages was set to 1, and peptide scores to indicate identity were used for peptide identification without manual inspection of MS/MS spectra. MSQuant version 1.4a was customized for [ 13 C 6 ]Leu SILAC to determine the ion counts in chromatograms for absolute concentrations of proteins using known amounts of synthetic peptides. MS-Quant is open source software developed by us and available at sourceforge.net.
Protein Abundance Determination-To calculate the number of observable peptides per protein, proteins were digested in silico, and the obtained peptides masses were compared with the scan range of the mass spectrometer. In addition, the expected retention times under our nano-LC conditions were calculated according to the procedure of Meek (25) and Sakamoto et al. (26) with our own coefficients based on ϳ1500 peptides. Peptides that were too hydrophilic or hydrophobic were eliminated. An in-house program was written in PHP to calculate the peptide number and was used to export all data to Microsoft Excel. The program is freely accessible at xome.hydra.mki.co.jp:8080/bitt/common/Menu. Regarding the number of observed peptides per protein, three methods of counting were used, i.e. 1) counting unique parent ions, 2) counting unique sequences, and 3) counting unique sequences without partial modification and the overlap caused by missed tryptic cleavage. These numbers were exported from Mascot html files to Excel spreadsheets using the "Export All Peptides" function of MSQuant software.
The PAI is defined as where N obsd and N obsbl are the number of observed peptides per 1 The abbreviations used are: PAI, protein abundance index; em-PAI, exponentially modified protein abundance index; SILAC, stable isotope labeling with amino acids in cell culture; SCX, strong cation exchange chromatography; HSA, human serum albumin.
protein and the number of observable peptides per protein, respectively (17). The emPAI is defined as follows.
Thus, the protein contents in molar and weight fraction percentages are described as where M r is the molecular weight of the protein, and ⌺(emPAI) is the summation of emPAI values for all identified proteins. The entire procedure for emPAI calculation is shown in Supplemental Sheet 1.
To evaluate the accuracy of the parameters, a deviation factor was defined as Deviation factor ϭ Value measured Value estimated (Eq. 5) where measured values are larger than estimated values or Deviation factor ϭ Value estimated Value measured (Eq. 6) where estimated values are larger than measured values. DNA Microarray Analysis-HCT116-C9 cells were plated at 5.0 ϫ 10 6 cells/dish in 10-cm-diameter dishes with 10 ml of the culture medium. After 24-h preincubation, the cells were treated for 12 h with 0.015% DMSO. Duplicate experiments were performed using Affymetrix HuGene FL arrays according to established protocols. Affymetrix GeneChip software was used to extract gene signal intensities, and two sets of data were grouped and averaged based on gene symbols. larger amounts of HSA close to 1 pmol. However, even in the region where the peak area is linear, the number of peptides does not have a linear relationship to the protein amount.
Interestingly the number of peptides shows a linear relationship to the logarithm of the injected amount from 3 to 500 fmol (Fig. 1B). A similar result was obtained on an LCQ with the slower scan cycle (see "Materials and Methods"). This finding indicates that each peak was well separated in time and that the influence of "random sampling" caused by the slower scan could be neglected under this condition. In this case, three ways were used to count peptides: 1) all parent ions including different charge states from the same peptide sequences, 2) all peptides excluding different charge states and partial modifications such as methionine oxidation, and 3) peptides with unique sequences excluding peptides overlapped by missed tryptic cleavage sites. Fig. 1B shows that the number of peptides based on unique parent ions (Fig. 1B, squares, and 1) above) shows the best correlation with the logarithm of protein abundance. We believe that these results are not due to the particular conditions used but are a more general phenomenon. Recently two groups independently presented similar curves relating the number of peptides to the concentration of proteins (19,27). Although neither of them analyzed the logarithmic relationship, it appears to us that their data are also consistent with a linear relationship between the logarithm of protein concentration and the number of peptides. At present, it is not clear why the logarithm of protein concentration correlates with the number of observed peptides, and in any case this relationship is likely to be due to a combination of processes and probably holds only approximately. In any case, it is a common experience that the mass spectrometric peptide signals from the digestion of a protein are vastly different. For example, to substantially increase sequence coverage of a protein often requires orders of magnitude large protein amounts, and conversely dilution by small factors often does not decrease sequence coverage very much.
PAI of 46 Proteins in Complex Mixture Solutions-To test performance of the PAI index in complex mixtures, we next investigated known amounts of 54 proteins in a whole cell lysate. Tryptic peptides from mouse neuroblastoma neuro2a cells SILAC-labeled with [ 13 C 6 ]Leu (4) were measured by a single LC-MS/MS run with the QSTAR instrument, and 336 proteins were identified based on 1462 peptides. For accurate absolute quantitation, we spiked 54 synthetic peptides containing [ 12 C 6 ]Leu into this sample solution, one for each protein, and quantified the corresponding tryptic peptides containing [ 13 C 6 ]Leu. Eight peptides were not quantified because they resulted in overlapping peaks in the extracted ion current chromatograms. Together 46 proteins ranging in molecular mass from 13 to 193 kDa were quantified in the range from 30 fmol to 1.8 pmol/l in the sample solution as listed in Table I. In complex protein mixtures, two additional factors should be considered. One is the influence of protein size on the number of peptides. Generally larger proteins generate more detectable peptides. Therefore, observable peptides were used for normalization as done previously except that we used the predicted peptide retention time as an additional filter. The other factor is the mixture complexity. A very large number of peptides exist in total cell lysate, and the number of observed peptides could to some extent be influenced by the random selection for MS/MS events, ion suppression effects, and saturation of the MS analyzer and/or the detector. Nevertheless Fig. 1C shows that there is still a linear relationship between log[protein] and the number of observed peptides normalized by the number of observable peptides per protein even when different proteins were plotted into one graph. Compared with other parameters, PAI correlated most highly with logarithm of protein amount (Fig. 1C, r ϭ 0.89, deviation factor (average Ϯ S.D.) ϭ 1.6 Ϯ 0.5) followed by number of peptides divided by protein M r (Fig. 1D, r ϭ 0.84, deviation factor ϭ 1.8 Ϯ 0.8), a measure similar to PAI except that it ignores how well the peptide sequence can generate tryptic peptides in the correct mass range for mass spectrometry. Commonly used proxies for protein abundance such as Mascot score and the number of peptides correlate much worse with protein abundance (r ϭ 0.72, deviation factor ϭ 2.7 Ϯ 2.4 and r ϭ 0.71, deviation factor ϭ 2.7 Ϯ 2.6, respectively).
Absolute Quantitation Using emPAI-Although PAI can estimate the abundance relationship between proteins, it cannot express the molar fraction directly. Therefore, we derived a new parameter, emPAI, that is the exponential form of PAI minus 1 (Equation 2) and that is directly proportional to protein content as shown in Fig. 1E. To calculate the absolute concentrations, total protein amounts were measured as weight by BCA assay, and the weight fractions of 46 proteins among 336 neuro2a proteins were calculated using Equation 4. As shown in Fig. 1F, the emPAI-based concentrations were highly consistent with the actual values (y ϭ 0.973x, r ϭ 0.93), and the deviation factors ranged from 1.03 to 4.98 with an average of 1.74 Ϯ 0.79. The outlier in (x, y) ϭ (10.6, 2.13) is clathrin heavy chain (CLH_RAT). Mouse clathrin is not in the current Swiss-Prot but in TrEMBL (Q68FD5_MOUSE), which was not used for protein identification. Q68FD5_MOUSE is not identical in sequence to CLH_RAT. It is possible that the number of observed peptides would increase using Q68FD5_MOUSE or other sequences instead of CLH_RAT, although Q68FD5_MOUSE needs more validation for Swiss-Prot entry. Note that these measures of confidence compare favorably with protein abundance by comparative gel staining and indeed with the Bradford assay itself used here to measure total protein amount (28). Furthermore just as there are proteins known to stain well, the emPAI of certain proteins could also be adjusted in the future. In any case, the emPAI approach seems to provide a reasonably accurate estimate for comprehensive absolute quantitation.
Dependence of emPAI on Experimental Conditions-In this experiment, we used the fast MS and MS/MS cycle time on the QSTAR to maximize the number of MS/MS events. When a slower cycle was used, the deviation from the linear relationship between emPAI and the protein concentrations increased (Fig. 2, A and B) due to the random sampling effects mentioned above. This effect was more pronounced when an LCQ ion trap instrument was used (Fig. 2C) presumably because the limited trap capacity results in a biased peak se-lection for more abundant proteins, and indeed a larger deviation was observed for more abundant proteins. We used an LTQ, a linear ion trap instrument that has a higher capacity and a faster scan time when compared with the LCQ (29,30), to evaluate the influence of the cycle time. The use of LTQ data improved the accuracy of emPAI in comparison to LCQ data (Fig. 2D). However, the faster scan cycle of LTQ com-

emPAI for Protein Abundance Estimation
pared with QSTAR did not provide better correlation between emPAI and protein abundance. This could be because of the limited capacity of LTQ to trap ions even in the linear configuration. We also evaluated the influence of sample complexity by using multidimensional chromatography (23,31). As shown in Fig. 2E, SCX fractionation gave improvement in emPAI accuracy. To confirm the effect of the reduction of sample complexity on emPAI accuracy, we used both QSTAR and LTQ in combination with SCX fractionation and obtained in total 2752 identified proteins with 11,727 non-redundant peptides from neuro2a cells. The correlation between emPAI values of the 46 test proteins and their protein abundances was significantly improved as shown in Fig. 2F. Note that PAI values of 22 proteins of 46 proteins were more than one in this analysis, whereas only two proteins had PAI values of more than one in the QSTAR analysis without SCX fractionation. This result shows that under the current conditions emPAI was not saturated. However, it would be possible to saturate emPAI if some proteins are highly abundant. We observed this in our analysis of the malaria proteome where hemoglobin was extremely abundant because the samples were prepared from red blood cells (15). Extremely abundant proteins may furthermore affect the efficiency of protein identification because of ionization suppression and detector saturation as well as the limited loading capacity of LC columns. The removal of extremely abundant proteins is therefore required to improve the identification efficiency and can be achieved by gel-enhanced LC-MS (one-dimensional gel followed by slicing, digesting, and LC-MS analysis) as shown in our malaria proteome study or albumin depletion treatment for plasma proteome studies. Such a treatment will also remove the influence of emPAI saturation. We also examined the influence of the injected sample amounts on the emPAI-based molar fractions. Using the whole cell lysate of neuro2a cells, three different levels (basal and 3ϫ and 9ϫ dilutions) were analyzed by LC-MS/MS. For 20 proteins with commonly identified peptides in all three analyses, constant values of the molar fraction were obtained (deviation factors were 1.66 Ϯ 0.55 for 3ϫ dilution and 1.85 Ϯ 0.85 for 9ϫ dilution, respectively), whereas emPAI values depended on the injected amounts as expected.
Application to Comprehensive Protein Expression Analysis-The emPAI is a convenient and easily obtained index that can be used to produce protein expression data from any LC-MS/MS runs. We applied this approach to obtain data for comparison with gene expression data in HCT116 human cancer cells. A DNA microarray provided expression data for 4971 genes, whereas a single LC-MS run provided 402 identified proteins based on 1811 peptides with unique sequences. Bridging gene symbols with protein accession numbers resulted in a total of 227 genes/protein pairs for the expression comparison study. A weak correlation was observed in Fig. 3 as expected from previous studies on yeast (19,32). Interestingly most of the outliers were ribosomal proteins. It is well known that, unlike prokaryotes such as Escherichia coli, mammalian cells regulate the expression levels of ribosomal proteins not only by transcription but also at the steps of transport of mRNA and translation and by degradation of excess amounts of proteins not associated with rRNA (33,34). Accordingly in a comparison study between gene and protein expression levels using emPAI for E. coli, we did not find such a deviation of ribosomal proteins. 2 Although both gene and protein expression data are not sufficiently accurate to discriminate a 10% difference, for instance, it is quite helpful to obtain a broad overview as shown above. We also note that the protein quantitation error of our simple emPAI is similar or better than the error in determining mRNA expression in DNA microarrays.
Conclusions-We have established a scale for estimating absolute protein abundance named emPAI. Because emPAI is easily calculated from the output information of database search engines such as Mascot, it is possible to apply this approach to previously measured or published datasets to add quantitative information without any additional steps. emPAI can also be used for relative quantitation especially in cases where isotope-based approaches cannot be applied because of quantitative changes that are too large for accurate measurements of ratios, because metabolic labeling is not possible, or because sensitivity constraints do not allow chemical labeling techniques. In such cases, emPAI values of proteins in one sample can compare with those in another sample, and the outliers from the emPAI correlation between two samples can be determined as increasing or decreasing proteins.
This emPAI approach was applied to multidimensional separation-MS/MS to extend the coverage of proteins. Further improvement would be possible by optimizing MS instrumentdependent parameters such as ionization dependence on m/z region. Because the emPAI index can be calculated with a simple script and does not require further experimentation in protein identification experiments, we suggest its routine use in the reporting of proteomic results. * Work at LSFT and CEBI was supported by NEDO (New Energy and Industrial Technology Development Organization, Japan) and the Danish National Research Foundation, respectively. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. □ S The on-line version of this article (available at http://www. mcponline.org) contains supplemental material.