|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||





,

From the
Mayo Proteomics Research Center, Divisions of
Biostatistics and ¶ Epidemiology, and || Department of Endocrinology, Mayo Clinic College of Medicine, Rochester, Minnesota 55905 and ** Department of Chemistry, W. M. Keck FT-ICR Mass Spectrometry Laboratory, North Carolina State University, Raleigh, North Carolina 27695
| ABSTRACT |
|---|
|
|
|---|
Interpreting the resulting spectra, however, is not completely straightforward. Contributions from the artificially enriched 18O isotopes and naturally occurring isotopes (particularly 13C and 34S) combine to form complex, overlapping isotopic distributions especially for high molecular weight peptides where the monoisotopic peak is no longer the most abundant isotope. Schnolzer et al. (10) demonstrated that once the initial cleavage and exchange has occurred the proteolytic peptide forms a pseudosubstrate for the enzyme, causing the incorporation of a second 18O atom. However, different peptide pseudosubstrates have different reaction rates, Km, with trypsin, which causes peptide-to-peptide variability in the incorporation of 18O. The reaction rate is determined by such factors as peptide length (10) and sequence composition (11).
Recent digestion protocols (11, 12) introduce a further complication because two digestion steps are utilized: a first digestion using solution-based trypsin in H216O and a second digestion using immobilized trypsin in H218O. For peptides with high incorporation rates, the two-step procedure is effectively the same as the single step digestion; however, with low incorporation rate peptides, the possibility arises that a peptide molecule cleaved in H216O may never reassociate with trypsin in H218O. Bantscheff et al. (13) observed that, when using the two-step digestion procedure, it is important to account for the peptide molecules with no 18O atoms at their C termini that remain in the labeled sample. In addition to these complications, other factors that can affect the labeling process include back exchange and pH (14).
Several different methods are described in the literature for calculating 16O/18O ratios while accounting for variable incorporation. Yao et al. (5) used a ratio combining the experimental abundances of the 18O0, 18O1, and 18O2 peaks along with the theoretical isotopic abundances of those peaks; chemical composition information derived from known peptide sequence or from product ion spectra was used to obtain the theoretical isotopic abundances. Johnson and Muddiman (15) removed the requirement for sequence identification/product ion spectra by utilizing the average amino acid averagine (16) to calculate approximate chemical compositions, by modeling the contributions of 13C and 34S with a power function, and by including this in the ratio calculation. Both of these calculations assume a single step digestion and do not account for 18O0 abundance in the labeled sample.
At least two software implementations exist that perform ratio calculations; both have limitations. Halligan et al. (17) automated the process of combining the peptide sequence information from Sequest (18) with either the Yao et al. (5) or Johnson and Muddiman (15) ratio calculations. They required zoom scans, utilized the protein identification data to determine the monoisotopic mass, and assumed that this mass appeared experimentally. Qian et al. (19) utilized THRASH (20) to detect peptide pairs and then applied the ratio equations of Johnson and Muddiman (15). THRASH, however, uses a single theoretical isotopic distribution and must separately detect the 18O0 and 18O2 species; therefore it will have difficulty with more massive peptides or those with incomplete incorporation.
Our current work describes a completely automated method for locating and interpreting 18O-labeled isotopic clusters in parent ion chromatograms without requiring product ion or selected ion monitoring/zoom spectra. Central to our algorithm is the use of linear regression to simultaneously fit all peaks in the isotope cluster rather than just the peaks representing the 18O0, 18O1, and 18O2 species. We use the residuals from the regression to compute a fitting score between the theoretical and experimental isotopic distributions that can be used to rank potential candidate biomarkers. Mirgorodskaya et al. (21) also used linear regression with 18O-labeled internal standards but measured the relative abundances of the 18O1 and 18O2 species for a given peptide in a separate experiment instead of allowing these abundances to vary in the regression.
We also describe a method of distinguishing highly up-regulated peptides from highly down-regulated or C-terminal peptides. When using highly enriched H218O, peptides that have fully incorporated two 18O atoms and that appear only in the labeled sample appear identical to peptides that have incorporated no 18O atoms either because they were C-terminal (and thus not a substrate for trypsin) or because they were significantly down-regulated. By lowering the enrichment level of the H218O to, for example, 90%, a unique peak "signature" is generated that the algorithm can recognize to distinguish these two cases. Furthermore our algorithm accounts for the residual 18O0 abundance that remains in the labeled sample from peptides with low incorporation rates; this is particularly relevant when utilizing the two-step digestion procedure.
In a manner similar to THRASH, the algorithm performs automated "reduction" of spectra. No assumption is made regarding monoisotopic mass, which is determined using an alignment procedure similar to that described by Senko et al. (16) and utilized by THRASH. However, THRASH operates directly on the raw spectral data; this allows it to detect species at very low signal to noise ratios but also makes it very slow. Instead we have focused on centroided peak data supplied by the vendor software from parent ion mass spectra for reasons of speed and because such data were readily available. The algorithm is indeed quite fast, interpreting a chromatogram of
1,300 spectra in less than 1 min; THRASH might take several hours to process such data. The algorithm was developed using data from FT-ICR instrumentation but should be generally applicable to any instrument with sufficient resolving power.
We will describe in detail the operation of our algorithm, entitled regression analysis applied to mass spectrometry (RAAMS), and evaluate its performance using datasets with both known and unknown ratios.
| MATERIALS AND METHODS |
|---|
|
|
|---|
, respectively. The second dataset was comprised of the same two proteins digested in 90% H218O or 100% H216O and mixed in the same ratios. Each sample was then analyzed by LC-FT-ICR. Thus, a dataset was comprised of nine chromatograms at nine different 18O/16O ratios. Note that the two pure proteins were present at approximately equal amounts in both the labeled and unlabeled samples; only the ratios of labeled to unlabeled digests of these proteins were varied. It is important to understand that these known ratios were generated for verification purposes only; a typical 18O experiment would consist of a single mixed sample with a different (and unknown) 18O/16O ratio for each protein. These two datasets are available upon request (see below) in mzXML format.
|
Sample Preparation
Human albumin, human apotransferrin, DTT, and iodoacetamide were purchased from Sigma. Sequencing grade trypsin for solution-based trypsin digestion was purchased from Promega (Madison, WI). Trypsin immobilized on agarose beads was purchased from Pierce. Disposable size exclusion columns (Micro Bio-Spin, P6 in Tris) were purchased from Bio-Rad. 18O-Enriched water (99 atom % enrichment) was purchased from Isotec (Miamisburg, OH). Solvents for LC-MS were purchased from Burdick and Jackson. Formic acid was from Fluka.
Two sample tubes, each containing 400 µg each of albumin and transferrin, were initially reduced with DTT and alkylated with iodoacetamide. Samples were digested with solution-based trypsin in H216O, 10% acetonitrile overnight at 37 °C. After this initial digestion, a second digestion using immobilized trypsin was used to exchange respective isotope labels into each sample. A 45-µl volume of immobilized trypsin beads, previously washed with 50 mM Tris, was added to each sample, and samples were concentrated to dryness on a Savant vacuum centrifuge. 50 µl of either H216O or the 18O-enriched water were added to their respective samples and were again concentrated to dryness by vacuum centrifuge. For the isotope exchange digestions, 100 µl of either H216O, 99% 18O-enriched water, or 90% 18O-enriched water were added to samples followed by 20 µl of acetonitrile that had been dried by storing it over a bed of sodium sulfate. 90% 18O-enriched water was prepared by adding a 10% volume of HPLC grade H216O to 99% 18O-enriched water. Samples were then vortexed for 7 h at room temperature on a Genie Vortex 2. After the exchange digestion, 2 µl of neat formic acid and 2 µl of 5% heptafluorobutyric acid were added to each sample, and samples were vortexed briefly and centrifuged for 5 min at 6,000 x g. 95 µl of supernatant from each sample were transferred to new microcentrifuge tubes for use in preparing prescribed ratios between the two samples. Two steps of 10-fold dilution (50 µl diluted to 500 µl) were performed for each albumin/transferrin sample using HPLC grade water containing 2% acetonitrile, 0.1% heptafluorobutyric acid, and 0.0005% Zwittergent 3-16. 18O/16O samples with ratios ranging from 100% 18O to 100% 16O were prepared from these individual standards. Sample mixtures were prepared with 18O/16O ratios of 20, 10, 3, 2, 1, 0.5, 0.33, 0.1, and 0.05.
Plasma samples were first depleted of the six most abundant proteins using an Agilent multiaffinity column. The non-retained fraction from each sample was assayed for total protein content using a Bradford assay and was denatured with urea, reduced, alkylated, digested, and labeled using a protocol from ProlyticaTM (Stratagene, La Jolla, CA). The labeled digests were then mixed based upon their total protein content, desalted on a reversed phase cartridge, and fractionated on a 4.6-mm-inner diameter PolySULFOETHYL aspartamide SCX column using a potassium phosphate/potassium chloride buffer system at pH 3.0 with 20% acetonitrile. SCX aliquots were vacuum concentrated to dryness and reconstituted in water containing 2% acetonitrile, 0.1% heptafluorobutyric acid, and 0.0005% Zwittergent 3-16.
Nanoscale LC-MS Analyses
Nanoscale LC-MS analyses were performed on a Thermo LTQ-FT mass spectrometer (Thermo Electron, San Jose, CA) interfaced to an Eksigent NanoLC-1D system (Eksigent Technologies, Livermore, CA). An Eksigent AS1 autosampler and an auxiliary isocratic pump flowing at 10 µl/min were used to load 10-µl sample injections onto a 0.25-µl Opti-Pak precolumn (Optimize Technologies, Oregon City, OR) packed with Michrom Magic C8, 5 µm, 200 Å (Michrom BioResources, Inc., Auburn, CA). Peptide separations were performed on a 15-cm by 75-µm-inner diameter column packed with Magic C18, 5 µm, 200 Å (also Michrom BioResources, Inc.). A gradient from 5 to 40% B over 60 min was used with a column flow rate of 400 nl/min where mobile phase A was water/acetonitrile/n-propyl alcohol/formic acid (98:1:1:0.2 by volume), and mobile phase B was acetonitrile/n-propyl alcohol/water/formic acid (80:10:10:0.2 by volume).
FT-ICR-MS data and data-dependent linear ion trap MS/MS data were collected in parallel on an LTQ-FT mass spectrometer (Thermo Electron Corp., Bremen, Germany). FT-ICR-MS survey data were collected in full profile mode from m/z 350 to 1,900 at 100,000 resolving power (full width, half-maximum at m/z 400) using two microscans and a target population of 1 x 106 charges for automatic gain control (AGC) in the ICR cell. The linear ion trap performed MS/MS experiments on the three most abundant ions from the FT-ICR survey scans using 2 microscans, a precursor isolation width of 2 m/z units, 35% normalized collision energy, and 30 ms activation time. An automatic gain control target value of 1 x 104 was used for MS/MS experiments in the LTQ mass analyzer.
Software Implementation
Software was implemented in C++ on Linux and MacOS X. Performance measurements were made on a computer with a 3-GHz Xeon processor running Linux 2.6. Software and datasets are available as open source upon request.
| COMPUTATIONAL DATA INTERPRETATION |
|---|
|
|
|---|
A and
B,2 respectively, can be used to find differentially expressed peptides in either an absolute (
A
B) or relative (
B/
A) scale. The estimated variance-covariance matrix of
is also returned, allowing confidence intervals to be formed. The algorithm operates in three steps. First, a list of all potential isotope clusters is generated based on peak spacing consistent with any allowed charge state. Some isotope peaks will belong to multiple potential isotope clusters. Second, the ion abundances of each of the potential clusters are fit to a theoretical isotopic model using linear regression, giving a best fit to each in terms of the relative amounts of 18O0, 18O1, and 18O2 for that peptide. Finally the algorithm decides between potential isotope clusters that share peaks based on their residuals from the regression fits. Each step is described in further detail below and summarized in Table I.
|
500 Da may legitimately have only two peaks (the A (18O0) and A + 1 (13C118O1) peaks with negligible 13C contribution) above the limit of detection, but these are very difficult to distinguish from noise spikes. Three special cases converge to make it difficult to determine, based only on peak spacing, the correct charge state for a given run of peaks. (a) Noise peaks can interdigitate between real peaks. (b) Isotope clusters can legitimately overlap and share one or more peaks. (c) Low abundance peaks can fall below the signal to noise threshold of the peak detection algorithm. Examples of these are shown in Supplemental Fig. S1. To address the issue of noise peaks, legitimate peak sharing, and missing peaks, the algorithm attempts to fit all charge states that are consistent with the peak spacing data. The algorithm also tolerates up to two missing peaks internal to a candidate isotope cluster. In each of the three cases, the result is one or more isotope clusters with correct charge state assignments and one or more incorrect assignments that share peaks with the correct assignments. We resolve between these correct and incorrect isotope clusters in step three of the algorithm (described below).
Step Two: Examine Peak Abundances
In step two of the algorithm, the candidate isotope clusters from step one are evaluated based on their abundances. The purpose of step two is to determine the contributions of the three label states (exchange of zero, one, or two oxygen atoms at the C terminus of the peptide) to the experimentally observed abundances of the isotope cluster by least squares fitting of a linear combination of expected isotopic abundances for each of these states. For each candidate isotope cluster, its charge state (determined in step one) and the mass of its first peak are used to compute an approximate neutral mass, and then an approximate chemical composition is determined using the average amino acid averagine, which has the chemical formula C4.9384H7.7583N1.3577O1.4773S0.0417 (16). Optionally the algorithm can adjust up or down the number of sulfurs predicted by averagine and perform the steps below for these additional expected chemical compositions.
As shown in Fig. 2a, for each chemical composition, three theoretical isotopic distributions are generated using Mercury (22) that correspond to the three label states. The first theoretical isotopic distribution is that which would result if all the molecules of peptide had not exchanged either of their C-terminal oxygen atoms (the "18O0 label state") and is shown as black bars in Fig. 2a. The second theoretical distribution is that which would result if all of the molecules of peptide exchanged exactly one of their C-terminal oxygen atoms for an oxygen atom from the H218O used for labeling (the 18O1 label state). If the purity of the H218O were 100%, the 18O1 distribution would be exactly the same as that of the 18O0 label state except shifted to the right by 2 Da and thus would be indistinguishable. As will be explained later, H218O of 90% purity is used instead, resulting in the 18O1 distribution shown with light gray bars in Fig. 2a. Similarly the third theoretical distribution (18O2 label state) would result if all of the peptides exchanged both of their C-terminal oxygen atoms for two oxygen atoms drawn from the 90% pure H218O. The three theoretical isotopic distributions are arranged in the columns of a matrix, X, as shown in Fig. 2b (note that in this figure, the matrices are shown transposed). The X matrix is scaled so that each column sums to 1. A column of ones is also added to represent the intercept. For a further discussion of the regression model, see Eckel-Passow et al. (23).
|
The goal then is to determine the linear combination of the columns of X that most closely resembles all the experimental peaks. This is done using non-negative least squares regression (24, 25), which defines the best fit as that with the lowest sum-of-squares error. The result of this regression is a column vector ß with elements representing the amount of experimental abundance that can be accounted for by exchange of zero (ß1), one (ß2), or two (ß3) of the C-terminal oxygen atoms or by an abundance offset (ß0). The abundance parameters ß13 are constrained to be non-negative because they represent the amount of abundance in each of the three label states. By using a least squares fit, the abundances of all peaks in the isotope cluster are used in determining the abundances from sample A and sample B (
A and
B; described below) instead of just the abundances of A, A + 2, and A + 4 peaks as in previous work (5, 15). Mirgorodskaya et al. (21) also used regression analysis for quantification with 18O-labeled internal standards but measured the relative abundances of the 18O1 and 18O2 species in a separate experiment rather than allowing them to vary in the regression.
The regression is performed for each possible alignment of theoretical distribution to experimental peaks. The alignment with the lowest residual sum-of-squares error in the fit is chosen as correct; this is similar in concept to the alignment procedure described by Senko et al. (16). In choosing the correct alignment, it is important to account for the fact that the isotopic distribution of complete incorporation of two 18O atoms closely resembles that of the natural abundance distribution and that the two distributions are thus indistinguishable. This means that very low residuals can be obtained by a partial match to only the rightmost peaks of a cluster (incorrectly assuming, for instance, that the experimental 18O2 is the theoretical 16O2) because this fit will have fewer peaks and therefore less error. To prevent this, four zeros are added to each end of the X matrix when determining the correct alignment; these zeros (shown in gray in Fig. 2b) introduce high residuals for those problematic alignments at the tails of the distribution that would otherwise effectively ignore a large amount of experimental abundance. However, they also bias the parameter estimates. Therefore, once the correct alignment is found, a version of X lacking these extra "isotopes" is fit to the y vector for the correct alignment, and the resulting ß is used for all further calculations.
The result of this final non-negative least squares fitting is an estimate of the vector ß where elements 13 represent the abundances contributed by the exchange of zero, one, or two of the C-terminal oxygen atoms, respectively, and element 0 represents an intercept term. These values take into account and correct for the presence of both the naturally occurring isotopes (13C, 34S, etc.) and also for the e.g. 90% impurity of the H218O. However, to convert these values into the original sample abundances
A (unlabeled; 16O) and
B (labeled; 18O), we must take into account the fact that trypsin has different affinities for different peptide pseudosubstrates. Two events must occur for a peptide molecule to incorporate an 18O atom. 1) That peptide must react (exchange) with trypsin, and 2) an 18O atom must be chosen from the 90% H218O. The probability of a given peptide molecule having at least one interaction with trypsin is equal to (1 eKt) where K is a reaction constant that varies from peptide to peptide and t is reaction time; we denote this probability pc. If a given peptide molecule reacts at least once with trypsin, the probability of obtaining an 18O atom is equal to the purity of the H218O, denoted p. Thus, for a given peptide molecule c, the probability of incorporating a C-terminal 18O atom is as follows.
![]() |
Conversely the probability of not incorporating an 18O atom is (1 sc). The probability sc can be thought of as an incorporation rate or "effective purity."
There are four sets of molecules we must take into consideration: 1) molecules from sample A, which never contain 18O, 2) molecules with no exchange events in sample B, 3) molecules where exactly one of the C-terminal oxygen atoms has had at least one exchange event in sample B, and 4) molecules where both C-terminal oxygen atoms have had at least one exchange in sample B. In the single step digestion procedure, the probability of a molecule having no exchange events in sample B is zero. This is because, in the single step procedure, peptide cleavage and labeling both occur simultaneously in H218O. However, in the two-step digestion, there is the possibility that a peptide may be cleaved in H216O but then never (re)react with trypsin in H218O. In this procedure, the observed abundance in the A peak will be a combination of abundances from unlabeled molecules in both sample A and sample B. Thus, ß1 is equal to the abundance of molecules from sample A plus those molecules from sample B where neither of the C-terminal oxygen atoms has exchanged. ß2 is equal to the abundance of molecules from sample B where only one of the C-terminal oxygen atoms has exchanged. The probability of this occurring is equal to the probability of having one 18O atom exchange and the other not exchange; there are two ways this could be done, and thus the final probability is 2pc(1 pc). ß3 is equal to the abundance of molecules from sample B where both C-terminal oxygen atoms have exchanged as follows.
![]() |
![]() |
![]() |
Solving sequentially for pc,
B, and
A yields the following.
![]() |
![]() |
![]() |
Thus, by assuming that the rate of incorporation of an 18O atom at the C terminus of a peptide is not influenced by the presence there of another 18O atom, we can use a function of the ratio of one to two incorporations of 18O (visible experimentally as the ratio ß2/ß3) to estimate the incorporation rate sc. Then the amounts of material in control (
A) and disease (
B) samples can be corrected for the incorporation rate using our estimate of sc. When the three parameters ß13 are non-zero, we calculate the ratio of material in sample B (labeled) to that in sample A (unlabeled) as
B/
A, making use of the correction for sc as shown above. This requires that we detect at least the 18O0, 18O1, and 18O2 peaks to obtain estimates for ß1, ß2, and ß3 and that these estimates are not constrained to zero by the non-negative regression. In cases where at least one of the parameter estimates is constrained to zero, we directly use the ratio (ß2 + ß3)/ß1. In both cases we constrain to a minimum or maximum ratio to avoid dividing by zero. We could also use sc to filter out those peptides that, due to a low incorporation rate, do not accurately report the true ratio of amounts. Fig. 2c demonstrates that the regression also provides predicted isotopic abundances, denoted
and shown as circles.
To evaluate the goodness of fit between the experimental and predicted isotopic abundances, we compute a fitting "score" as in Equation 8,
![]() |
where the sum of the squared residuals from the fit is divided by the sum of experimental peak abundances. The denominator serves to normalize the error from the regression for comparison across isotope clusters with varying abundances and numbers of peaks. Lower scores are better in that clusters that more closely agree with their predicted abundances will have lower scores. Optimal scoring is an area of active investigation for our group.
Step Three: Decide between Charge Assignments
The third and final step of the algorithm further examines the candidates from step one that share common peaks. Recall that missing peaks, noise peaks, and overlapping isotope clusters introduce complications into the charge assignment process. Additional fits are attempted to decide which of these assignments are correct, and these additional fits result in isotope clusters that share peaks. Furthermore this peak sharing can be either legitimate (when two real isotope clusters overlap) or artifactual as in the case of noise peaks or missing peaks (see examples in Supplemental Fig. S1). A strict rule might be to never allow a peak to participate in more than one isotope cluster, but this would incorrectly eliminate some clusters that legitimately shared peaks.
Two competing factors must be accounted for in deciding between isotope clusters that share peaks. Isotope clusters with better (lower) fitting scores should be preferred over those with worse fitting scores. However, fitting score is influenced by the number of peaks in the isotope cluster: clusters with fewer peaks will have less error. This means that, for some isotope clusters, an incorrect assignment with lower charge state (and therefore fewer peaks) will have a better fitting score than a correct assignment with higher charge state. Therefore, we must also evaluate the amount of abundance that the isotope cluster accounts for out of the surrounding peaks. Obviously isotope clusters that use more of the surrounding peaks (i.e. a higher charge state) should be preferred over those that use fewer of them (i.e. a lower charge state).
We combine these two competing factors by computing a charge assignment score for each isotope cluster in the transitive closure of isotope clusters that share peaks (if an isotope cluster A shares peaks with B and B shares peaks with C, then ABC constitutes the transitive closure of isotope clusters that share peaks). We compute a charge assignment score for each isotope cluster in the transitive closure as the squared error from the fitting for this isotope cluster plus the squared abundances for all the other peaks in the transitive closure that were not used by that cluster (denoted y') divided by the sum of abundances of all peaks in the transitive closure as follows.
![]() |
We then rank each isotope cluster by charge assignment score from lowest to highest and examine each in turn in this order. Isotope clusters that share peaks and have charge states that are even multiples of each other are not allowed to co-exist. The algorithm decides among clusters that violate this sharing rule by picking the one with the best (i.e. lowest) charge assignment score. In this way, legitimate peak sharing is allowed (Supplemental Fig. S1b has overlapping clusters with z = 2 and z = 3, which are allowed), but the incorrect, duplicate clusters introduced by noise or missing peaks are removed.
Evaluation of Alignment Correctness
A key factor influencing the correctness of the ratios produced by the algorithm is whether it determines the correct monoisotopic mass, that is to say whether it determines the correct alignment between the theoretical and experimental isotopic distributions. When the algorithm incorrectly assigns the monoisotopic mass, it generally assigns either the 18O1 or 18O2 (A + 2 or A + 4) isotope as the monoisotopic mass, introducing a shift in mass of 2 or 4 Da. Although there are indeed some species that differ by exactly a small, integer number of daltons, the most likely explanation for such a mass error is that the algorithm has incorrectly assigned the monoisotopic mass. Because peptides with masses less than
4,000 daltons have readily detectable monoisotopic masses, determining the correct alignment is trivial using the chromatogram from the sample digested in H216O. We used these two pieces of knowledge (±4-Da shift and correct monoisotopic mass from 16O sample) to verify the algorithm when fitting variable ratios.
We ran the algorithm, using a single theoretical isotopic distribution corresponding to the natural isotopic abundances for the chemical composition predicted by averagine, against the 16O chromatogram and then combined the appearance of species at multiple charge states, in multiple spectra chromatographically, and across each of the samples by using mass and retention time thresholds of 20 ppm and 120 s, respectively, as described previously for our label-free approach (26). (We also used thresholds in retention time to remove artifact species, such as polymer introduced by the chromatography, as described by Mason et al. (27).) For each species we determined a consensus monoisotopic mass across all isotopic clusters for that species as their abundance weighted average of monoisotopic mass.
Next we applied the algorithm to all nine chromatograms each of the 90 and 99% enrichment datasets but allowed the algorithm to fit unknown 18O/16O ratios. We then used the same grouping procedure on this "unknown" data with one exception: we allowed two isotope clusters A and B to be grouped across multiple spectra and multiple charge states if mass(A) = mass(B) + n x 1.00235 Da ± 20 ppm where n varies from 0 to 4. We then rated the correctness of the algorithm in determining the monoisotopic mass for each cluster in the unknown data as whether its monoisotopic mass is the same (within 20 ppm) as the consensus monoisotopic mass from the 16O data.
| RESULTS |
|---|
|
|
|---|
0.1 and 0.5, very few isotope clusters of high abundance are observed, and their ratios vary widely from the expected. Manual inspection of 50 isotope clusters with a range of abundances and with scores between 0.1 and 0.105 revealed that 42 of 50 were correctly assigned. However, at scores between 0.2 and 0.205, only 5 of 50 were correct. Based on this we have used a threshold value of 0.1, shown by a solid vertical line. Note that at very high score (
0.5) more abundant peaks are again observed; manual inspection revealed that many of these occur when noise peaks are separated (by random chance) from a highly abundant peak by exactly 1/z. This results in very high residuals for these incorrect fits.
|
Fig. 3b shows a histogram, for each of the nine expected ratios, of the number of isotope clusters (horizontal axis) that have a given ratio (vertical axis, which has the same scale as in a). Only those clusters with scores less than the threshold of 0.1 are shown. Stacked bars are colored red, green, or black to indicate alignment correctness (note: truncated bars are one-third the correct length). In general, the experimental ratios correspond well with the expected ones out to expected ratios of about 20 and 0.05; the distributions do broaden, however, at the more extreme ratios. Some bias toward 16O is observed in the ratios; reverse labeling experiments and/or normalization may be useful in correcting this.
Fig. 4 shows FT-ICR mass spectra of a single tryptic peptide of BSA, having been digested in (a) normal H216O, (b) H218O reported by the vendor as >99% pure, and (c) 90% pure H218O. In Fig. 4b the majority of the peptide molecules have incorporated two 18O atoms, producing the expected 4-Da mass shift. As can be seen from comparing Fig. 4, a and b, the isotopic abundance distributions of 16O- and that of 18O-labeled peptides have a very similar shape (imagine shifting the isotopic distribution in Fig. 4a 4 daltons higher). Thus, when a peptide exists in high abundance in one sample but is at a level below the detection limit in the other, it is not possible to tell from which sample it originated. This determination depends in large part on the alignment step in the algorithm that attempts to determine which peak is the monoisotopic peak; if the wrong peak is decided upon, the algorithm will compute incorrect ratios. Moreover the C-terminal peptide of each protein in the sample will not incorporate 18O label because it is (typically) not a substrate for trypsin. A spectrum such as in Fig. 4, a or b, could then be the result of a peptide that is highly up-regulated, is highly down-regulated, or exists only in one of the samples; it could also result from a C-terminal peptide.
|

of the peptide molecules have incorporated only a single 18O atom, and a small fraction,
4%, have not incorporated any. Thus, utilization of 90% H218O incorporates a signature to discriminate, for example, a C-terminal peptide that is unlabeled from a highly up-regulated, labeled (18O2) peptide.
Fig. 5 shows the advantage of using impure H218O in distinguishing extreme ratios by depicting histograms of the number of correct versus incorrect alignments for each of the nine expected 18O/16O ratios in both 99 and 90% H218O. Across a range of abundance bins (reported as the base 10 logarithm of the abundance of the most abundant peak in each isotope cluster), the number of isotope clusters where the alignment was determined to be correct is shown in green, whereas the number with incorrect alignments are shown in red. The number of isotope clusters where the correct alignment could not be determined (see above for procedure) is shown in black. Only isotope clusters with scores less than the threshold of 0.1 are included. In Fig. 5a the digest was performed in 99% pure H218O. At lower ratios (more 16O), the algorithm has difficulty determining the correct alignment, as shown by the large number of isotope clusters with incorrect alignments, because the expected distributions are indistinguishable to the algorithm. However, when the purity of the H218O is dropped to 90%, as shown in Fig. 5b, the algorithm does better as indicated by the lower number of incorrect alignments at low 18O/16O ratios. This is because the additional impurity causes a distinct pattern of peaks in the extreme 18O case where most of the abundance in the combined sample comes from the labeled sample (where the 18O/16O ratio is high). The algorithm uses this fact to distinguish the two extremes of labeling during the alignment process in step two. At abundances levels where the small (
1%) peaks that distinguish the pure 18O case fall below the specific signal to noise level of the instrument, the algorithm again struggles to determine the correct alignment as indicated by the increasing fraction of red at the lower abundances in both a and b at all ratios. Ratios closer to 1 are easier to detect at lower abundance because they are more likely to have all three of the A, A + 2, and A + 4 peaks. Work to determine the optimum level of impurity is ongoing. Note that the total number of isotope clusters is not equal between Fig. 5, a and b, due to differences in chromatography.
|
This approach is demonstrated in Fig. 6, which shows a histogram of the distribution of the rate of incorporation of 18O, indicated by the effective purity sc, across isotope clusters found in the nine chromatograms with known ratios. Higher sc values indicate more efficient labeling, which would suggest higher affinity of trypsin for a given peptide. Three representative peptides are shown in a, b, and c in both the 16O sample (top) used to determine the correct monoisotopic mass and the all 18O sample (bottom), which should show complete incorporation; the predicted abundance distribution (
) is indicated by circles. Moving from right to left, Fig. 6c shows a peptide with a high sc value, indicating substantial incorporation of 18O. Fig. 6b shows a peptide at intermediate sc where less than half the molecules have incorporated one 18O atom and almost none have incorporated two. Finally in Fig. 6a a peptide with a low sc value is shown, indicating poor incorporation; indeed very few molecules have incorporated one 18O atom, and essentially none have incorporated two 18O atoms. Peptides with such a low incorporation rate are problematic because it is difficult to correctly estimate the incorporation rate when so little 18O2 is seen. In general, these peptides are simply not amenable to quantification.
|
|
50 is shown in Fig. 7b where circles indicate the predicted isotopic abundances (
). Fig. 7c shows a slightly down-regulated peptide (18O/16O
0.4) with a very good fit (score, 0.007). Fig. 7d shows a peptide with a neutral mass of 4,224 Da; the algorithm calculates a 18O/16O ratio of 2.8. This isotopic distribution would be challenging to interpret manually or with previous ratio calculations. It should be noted that, as the mass of peptides increase, their isotopic distributions become dominated by 13C isotopes, and they broaden considerably. Above
2,500 Da, the naturally occurring 13C makes it more difficult to detect the presence of the 18O introduced by the labeling.
|
| DISCUSSION |
|---|
|
|
|---|
Our algorithm makes use of centroided peak lists. This is in sharp contrast to THRASH, which does not perform a priori peak detection and instead directly examines the full-profile spectral data. Whereas we only attempt alignments that correspond to the apexes of peaks (typically 1012 alignments per cluster), THRASH, in contrast, tries a series of several thousand alignments in very small m/z increments assuming each of the several thousand samples in the spectral region under examination may contain the monoisotopic mass. This makes THRASH much more robust in its ability to detect low abundance clusters with peaks below the signal to noise threshold of a peak picking algorithm but also makes it orders of magnitude slower. Also although the present work uses linear regression to find the best fit in a least squares manner in both m/z and abundance, THRASH iterates an overall abundance parameter to scale the theoretical isotopic distribution being aligned.
Our algorithm attempts all allowed charge states that are consistent with the peak data; THRASH, in contrast, uses a Fourier-Patterson charge state determination algorithm (30) to determine the prevailing charge state in a spectral region. THRASH only falls back to fitting all allowed charge states if the fit resulting from the charge state determined by the Fourier-Patterson is poor. Our method relies on the fact that performing the regression is very fast and that far fewer possible alignments are attempted. THRASH also depends on a subtraction step in which the predicted isotopic abundances are subtracted from the original spectrum. Experience has shown this subtraction to be problematic in that it often leaves residuals behind that may have spectral characteristics, resulting in the detection of artifactual isotope clusters with acceptable fitting scores. We avoid the subtraction by simply fitting the two clusters separately.
| CONCLUSIONS |
|---|
|
|
|---|
The algorithm was applied here to FT-ICR data but should be generally applicable to any instrument with sufficient resolving power. Also although it was designed for 18O data, it handles unlabeled isotopic distributions as a special case and could be used to interpret label-free experiments as well. It could also be easily adapted for other labeling moieties. In the future we hope to improve our scoring scheme, to incorporate knowledge of the expected distribution of the incorporation rate (sc), and to incorporate protein identification data when available.
| ACKNOWLEDGMENTS |
|---|
| &nbs |
|---|