A data processing pipeline for mammalian proteome dynamics studies using stable isotope metabolic labeling

In a recent study, in vivo metabolic labeling using (15)N traced the rate of label incorporation among more than 1700 proteins simultaneously and enabled the determination of individual protein turnover rate constants over a dynamic range of three orders of magnitude (Price, J. C., Guan, S., Burlingame, A., Prusiner, S. B., and Ghaemmaghami, S. (2010) Analysis of proteome dynamics in the mouse brain. Proc. Natl. Acad. Sci. U. S. A. 107, 14508-14513). These studies of protein dynamics provide a deeper understanding of healthy development and well-being of complex organisms, as well as the possible causes and progression of disease. In addition to a fully labeled food source and appropriate mass spectrometry platform, an essential and enabling component of such large scale investigations is a robust data processing and analysis pipeline, which is capable of the reduction of large sets of liquid chromatography tandem MS raw data files into the desired protein turnover rate constants. The data processing pipeline described in this contribution is comprised of a suite of software modules required for the workflow that fulfills such requirements. This software platform includes established software tools such as a mass spectrometry database search engine together with several additional, novel data processing modules specifically developed for (15)N metabolic labeling. These fulfill the following functions: (1) cross-extraction of (15)N-containing ion intensities from raw data files at varying biosynthetic incorporation times, (2) computation of peptide (15)N isotopic incorporation distributions, and (3) aggregation of relative isotope abundance curves for multiple peptides into single protein curves. In addition, processing parameter optimization and noise reduction procedures were found to be necessary in the processing modules in order to reduce propagation of errors in the long chain of the processing steps of the entire workflow.


Description of Algorithms
The whole data processing pipeline is illustrated in Figure S1.PAVA.The peaklist extraction module, written in Visual Basic, accesses the raw data files generated by LTQFT through functions provided by the Xcalibur software (ThermoFisher Scientific, Bremen, Germany).PAVA separates scans according to (1) MS levels (MS, MSMS, or MS3), (2) final detection devices (for LTQFT, either LTQ or FT can be used), and (3) dissociation methods for MSMS or MS3 scans.All current dissociation methods including collision-induced dissociation(CID) , electron transfer dissociation(ETD), electron capture dissociation( ECD), infrared multiphoton dissociation(IRMPD), pulsed-q dissociation(PQD), and high collision energy dissociation(HCD) are support in PAVA.PAVA also separates survey scans into full or zoom scans.The output peaklists for MSn (n>=2) scans are formatted as Matrix Science's (Boston, MA) mgf format, which is compatible with most available database search engines.For spectra acquired in LTQ, data are first centroided if acquired in profile mode.The centroid peaklists are subjected to two stages of noise filtering similar to that employed in a previous study (1).In the first stage, the most intense peak is chosen within a +-1.1 m/z unit window.This step removes all isotope peaks except the most abundant from an ion's isotopic distribution.In the second stage, the top six peaks were chosen within a +-27 unit window.For spectra acquired in FT devices, PAVA centroids the spectra using a five-point routine with no noise filtering.PAVA allows batch processing and is very fast.It requires less than 10 seconds to process a 100MB raw data file containing 8000 MSMS scans acquired in centoid mode on the LTQFT instrument.The PAVA program is executable in a Windows environment, and requires installation of the Xcalibur software.
Protein Prospector (2,3).This in-house database search engine allows for simultaneous searches of multiple peaklist mgf files against a combined forward/reverse or random database.The peptide identification data were formatted as a tab delimited text file.fitXIC.This module, developed based on a concept of "identified ion chromatogram"(4), extracts the ion chromatogram for a peptide ion which has been identified by a database search.First, survey scans are centroided by PAVA's centroid routine described early and the identified ion chromatogram is extracted from the centroid data using theoretical monoisotopic m/z of the peptide and a large retention time window, centered at the MSMS time.The ion chromatogram is then linearly interpolated onto an even-spaced grid and moving window averaging is used to reduce the noise.The approximate peak location is found at the maximum of the smoothed peak and approximate peak width is obtained by measuring the smooth peak width at half peak maximum.The approximate peak height is obtained by centroiding the original data centered at the approximate peak location.These peak parameters are used as initial fitting values and the original data is fitted with a modified Gaussian function with a parabolic variance (PVMG) (5).Peak location, peak height at peak location, peak area, peak width, and coefficient of determination (R2) are obtained from the fit of the raw data.
CrossExtract.This module extracts all 15 N incorporated ions in raw data files of different 15 N incorporation time points, based on the retention times determined from the fitXIC processing and on m/z values from the elemental compositions of the identified ions.For an identified peptide, the number of possible nitrogen atoms (NNA) being metabolically labeled can be easily calculated from its sequence or elemental composition.Externally introduced nitrogen atoms, such as those in the chemical thiol alkylation groups, are not counted.The maximal possible 15 N incorporation is determined by the enrichment of the algae in the diet (99.56%).
To determine how many isotope peaks need to be extracted, the maximally incorporated isotope distribution is simulated by the mercury2 algorithm (6).The heaviest isotope peak of the distribution to keep must have an intensity of at least 1% relative to the intensity of the most abundant peak in the distribution.The number of ions needs to be extracted (denoted as NPX) is the Dalton units from the monoisotopic peak of the no 15 N incorporation to the heaviest isotope peak of the most incorporated distribution plus one, including the monoisotopic peak.The m/z of each of the ions to be extract is calculated by use of the monoisotopic m/z of the peptide ion plus the product of the m/z difference between 15 N and 14 N atoms and the number of 15 N atoms.Theoretically, the mass difference between the adjacent isotopes can be classified into two types.One type is due to the substitution of an 15 N atom for an 14 N (0.99703489Da) and the other is approximately the difference between an 13 C and an 12 C (1.00335484Da), in a natural isotope distribution.If spectra were acquired under an ultrahigh mass resolving power conditions, those peaks with the same nominal mass could be separated.Under our operating conditions, the fine structures (7) were not resolved.
The CrossExtract module averages survey scans in the raw data files within the retention time window with a width on the order of the extracted ion chromatographic peak width, centered at the peptide retention time.A built-in function of the Xcalibur software is called to average scans acquired in profile mode.The averaged spectral segment is then centroided and matched with the theoretical m/z values of possible 15 N labeled ions.A mass tolerance was obtained from the precursor m/z error distribution of identified peptide ions.The direct cross extraction is feasible only if the variation of retention time is small compared to the chromatographic peak width.We avoid the need to consider LC alignment (8) procedures by achieving retention time stability experimentally.As will be discussed in more detail, variation in retention times is minimized (a) by use of constant column temperature, (b) by normalization of injection volume according to total protein concentration, and (c) by proper aging of the column.

NNLS.
The module first synthesizes NNA +1 theoretical isotope distributions using the mercury2 algorithm (6) (an example is given in Figure 2).The mercury2 algorithm is adapted so that the total abundance of the isotope distribution was normalized to one.
The simulated distributions correspond to the peptide's elemental composition with 0 to NNA numbers of 15 N, at 99.56% enrichment.Recall that the total number of isotope peaks to be extracted, NPX, is the sum of the number of possible 15 N incorporations (NNA) and the number of isotope peaks with higher masses over the first isotopic peak in the most incorporated distribution.The theoretical distributions for all the numbers of 15 N incorporation formed a matrix A with NNA +1 rows and NPX columns.We can relate the 15 N incorporation distribution and the cross-extracted ion intensities with the following linear equation in which f i is the 15 N incorporation distribution of the peptide at the incorporation time point of i (=1, 2, …, NIT) and is a ( NNA +1) 1 vector whose elements indicate the contribution of a particular 15 N incorporated elemental composition to the observed ion intensity; p i is an NPX1vector of the cross-extracted (experimental) ion intensities at the incorporation time point of i.We solve the equation for f i by use of an non-negative linear least square (NNLS) algorithm (9).genCurve.This module implements an effective three-stage noise elimination procedure for each peptide ion.The genCurve module also converts the 15 N distribution matrix into a time vs 15 N relative fraction curve.Although the survey scans were acquired in high resolution, the cross-extracted 15 N incorporated ions might still be contaminated with a significant number of peaks that do not belong.The range of these ions could span tens of Daltons.The average number of possible 15 N incorporations for the peptides within our study was 18.6.The most common source of interference was due to oxidation of the same peptide (mainly on methionine residues).The oxidized form is 15.9949146221 Da heavier than molecular weight of the peptide.The same un-oxidized peptide with 16 15 N incorporation is 15.9525582912Da from the monoisotopic mass of the no incorporated peptide.The mass difference between the 16 15 N ion and the oxidized ion is merely 42milliDa.The contamination of cross-extracted ions becomes unwanted "noise" in the 15 N incorporation distribution and must be minimized.In order to reduce the interference, three levels of noise removal procedures were carried out.First, distribution values (except zero incorporation number) at later incorporation time points which do not increase in mass since t=0 were recognized as contamination and removed.Second, we construct a boundary between the signal and noise regions in the 15 N distribution matrix.
The boundary pointer starts at the last incorporation time and the highest incorporation peak location, moving towards the opposite corner of the 15 N distribution matrix.
Recognized that the 15 N distribution can only move to the smaller incorporation numbers with decreases in incorporation times, the boundary detection pointer starts at the previous 15 N incorporation number and checks the peak intensities at the current incorporation time point.It will move to a lower 15 N incorporation number if the intensity is less than 0.1% of the maximum.To accommodate the signal fluctuation, the pointer is allowed to move two incorporation numbers higher if the intensity is greater than the 0.1% of the maximum.The boundary is the trajectory of the pointer moving from the most incorporated time points and the highest incorporation number toward the zero incorporation time point and the zero incorporation number (an example is given in factor normalizes the maximal possible value of RIA to one.A peptide relative isotope abundance (RIA) curve is the function of the relative intensity of the newly synthesized peptide, RIA against the incorporation time.
In the genCurv module, we also calculate the relative number of 15 N atoms in the peptide Pep2Prot.Since the relative abundance of the newly synthesized peptide defined in Eq.2 does not depend on the number of exchangeable nitrogen atoms in a peptide, we can aggregate or combine the peptide relative abundance curves into a protein curve.It is common that the components in a protein complex share some peptides of the same sequences.The shared peptides were removed before the aggregation process.delay.In a recent dynamic SILAC experiment (12), the natural amino acids were used to "chase" the labeled amino acids.A single exponential function was used to fit the experimental decay curves.The critical difference between our method and that of (12) is that there is a delay in delivery of 15 N-labeled amino acids to cells in mammalian tissue.
We consider that it is appropriate to include a time delay parameter into the fitting function.The enrichment (relative abundance) of our newly synthesized proteins or peptides was model with The simple functional form of Equation 4 is able to capture the essential features of our peptide and protein turnover curves.However, some protein curves, obtained from the aggregation of many peptide curves, have detailed features that are not represented by this simple function.More sophisticated compartment modeling (10, 13) requires fitting of the curves with poly-exponential functions.The number of exponentials used corresponds to the number of compartments considered in the model.This will be the subject of our future investigations.

Figure 4 (
Figure 4(b)).The two dimensional space of incorporation time and incorporation number