A Catalogue of Human Saliva Proteins Identified by Free Flow Electrophoresis-based Peptide Separation and Tandem Mass Spectrometry*S

Human saliva has great potential for clinical disease diagnostics. Constructing a comprehensive catalogue of saliva proteins using proteomic approaches is a necessary first step to identifying potential protein biomarkers of disease. However, because of the challenge presented in cataloguing saliva proteins with widely varying abundance, new proteomic approaches are needed. To this end, we used a newly developed approach coupling peptide separation using free flow electrophoresis with linear ion trap tandem mass spectrometry to identify proteins in whole human saliva. We identified 437 proteins with high confidence (false positive rate below 1%), producing the largest catalogue of proteins from a single saliva sample to date and providing new information on the composition and potential diagnostic utility of this fluid. The statistically validated, transparently presented, and annotated dataset provides a model for presenting large scale proteomic data of this type, which should facilitate better dissemination and easier comparisons of proteomic datasets from future studies in saliva.

Therefore, to obtain a more comprehensive catalogue of saliva proteins innovative proteomic approaches are needed.
Recently we described a new approach (10) to proteomic analysis that uses preparative IEF by free flow electrophoresis (FFE) 1 (11,12) for a first dimension fractionation of complex peptide mixtures. The use of FFE not only provides a high resolution peptide separation, but also it adds a constraint of peptide pI information to the determination of peptide sequence matches in the sequence database search of the MS/MS data, significantly improving the confidence of the peptide sequence matches and effectively increasing the number of high confidence protein identifications (10,(13)(14)(15).
The goal of this study was to use peptide separation by FFE coupled with a linear ion trap mass spectrometer to comprehensively identify proteins in whole human saliva. We identified 437 proteins with high confidence, providing the largest catalogue of proteins from a single saliva sample to date. The protein catalogue provides new information on the composition of this bodily fluid and its potential utility in disease diagnostics. The statistically validated and transparently presented dataset (shown in the supplemental table) provides a model for presenting large, mass spectrometry-based proteomic data that should provide improved dissemination and comparison of datasets in this clinically important biological fluid.

EXPERIMENTAL PROCEDURES
Clinical Saliva Collection and Protein Preparation-Whole unstimulated saliva was collected from a healthy female subject in the University of Minnesota Oral Medicine Clinic using a protocol described previously (16). 1 ml of whole saliva was removed and centrifuged at 25,000 ϫ g and 4°C for 30 min. The supernatant was collected and quantified by using the BCA protein assay (Pierce), giving 1.05 mg of total soluble proteins. The saliva was brought to 100 mM with HEPES, pH 8.0 and 5 mM with Tris(2-carboxyethyl)phosphine and incubated overnight with 20 g of trypsin (Promega, Madison, WI) at 37°C. The resulting peptides were concentrated and desalted using a reversephase Sep-Pak cartridge (Waters, Milford, MA) and dried by vacuum centrifugation.

FFE Fractionation of Peptides and Sample
Processing-Preparative IEF of the peptide mixture was performed using a commercially available Pro Team free flow electrophoresis system (BD Biosciences) (11,12). The saliva peptides were dissolved in 250 l of FFE separation buffer and fractionated by FFE into a 96-well microtiter plate as described previously (10). Immediately after FFE separation, the pH of each FFE fraction was measured using a microelectrode (Accument combination microelectrode, Fisher). A 50-l aliquot (of ϳ500 l total) was taken from each of the microtiter plate wells and processed as described previously (10) prior to mass spectrometric analysis.
LC-ESI MS/MS Analysis-All LC separations were done on an automated Paradigm MS4 system (Michrom Bioresources, Inc., Auburn, CA). Each processed FFE fraction was automatically loaded across a Paradigm Platinum Peptide Nanotrap (Michrom Bioresources, Inc.) precolumn (0.15 ϫ 50 mm, 400-l volume) for sample concentrating and desalting at a flow rate of 50 l/min in HPLC buffer A. The in-line analytical capillary column (75 m ϫ 12 cm) was home-packed using C 18 resin (5-m, 200-Å Magic C18AG, Michrom Bioresources, Inc.) and Picofrit capillary tubing (New Objective, Cambridge, MA). Peptides were eluted using a linear gradient of 10 -35% buffer B over 60 min followed by isocratic elution at 80% buffer B for 5 min with a flow rate of 0.25 l/min across the column.
Peptides were analyzed by MS/MS using a linear ion trap mass spectrometer system (LTQ, Thermo Electron Corp., San Jose, CA). The electrospray voltage was set to 2.0 kV using a collision energy setting of 29% and a data-dependent procedure that alternated between one MS scan (over the m/z range of 400 -1800) followed by four MS/MS scans for the four most abundant precursor ions in the MS survey scan. Both the MS and MS/MS spectra were acquired using a single microscan with a maximum fill time of 50 ms in the ion trap. m/z values selected for MS/MS were dynamically excluded for 30 s.
Sequence Database Searching and Peptide Sequence Match Filtering-The MS/MS spectra were sequence database-searched using TurboSEQUEST (17) (Thermo Finnigan, San Jose, CA). The MS/MS spectra were searched against the non-redundant human International Protein Index database (18) containing ϳ50,000 protein sequences with a reverse version of the same database attached at the end of the forward version. The search parameters used included a precursor ion mass accuracy tolerance of 2.0 with methionine oxidation specified as a differential modification. Tryptic cleavage sites were specified as described below. The peptide sequence match results were organized and viewed using the software tool Interact (19). False positive rates were calculated as described previously (10,20). The predicted pI of peptide sequences was calculated according to Shimura et al. (21) using an automated script, and peptide pI values were automatically inputted into the Interact results file. For FFE fractions in the pH range of 6.5-8.0 (fraction numbers 46 -58), the average peptide pI value was used rather than the measured fraction pH for filtering peptide sequence matches in steps one and two (see "Results"). The MS/MS spectra were first searched against the database with the enzyme trypsin specified, allowing up to two missed cleavage sites in the peptide sequence match. To identify non-tryptic peptides derived from proline-rich proteins, as have been found in other proteomic studies of saliva (7,8), the MS/MS data were also searched with no enzyme specified, and the peptide matches were filtered by peptide pI and FFE fraction pH. This resulted in the identification of eight additional proteins, which were added to the protein results from the first filtering step described under "Results."

RESULTS
Our approach yielded a wealth of peptide sequence matches requiring filtering and statistical validation. To filter the sequence matches based upon peptide pI, it is first necessary to confirm the correspondence of peptide pI and measured FFE fraction pH for the dataset being analyzed (10). To this end, the sequence matches were first filtered using Peptide Prophet (22), which assigns to each peptide sequence match a probability (p) score between 0 and 1. The peptide sequence matches were initially filtered using a stringent p score threshold of 0.9. Next the theoretical pI for each matched peptide sequence was calculated (21), and the average peptide pI for each FFE fraction was determined. Fig.  1A shows the results of these calculations. The top two lines in the plot show the correspondence of the average peptide pI versus the measured pH value for each FFE fraction. Overall the close correspondence justifies the use of FFE fraction pH, in addition to p score, as a filtering criterion of peptide sequence matches for this catalogue as we describe below. There is some discrepancy between the pI and pH values in the pH range ϳ6.5-8.0. The reason for this discrepancy is unknown and needs further investigation, although it may reflect an inaccuracy in the pI prediction algorithm as it has been observed regardless of the method used for IEF of peptide mixtures (10,13,14). The bottom line in the plot shows the distribution of matched peptide sequence across each FFE fraction. The majority of the peptides cluster in the pH ranges 3.5-5.0 with very few peptides detected in fractions with neutral pH values (pH ϳ7-8), similar to the distribution of tryptic peptides in other studies using preparative IEF (10,13,14).
Our approach to generating a high confidence catalogue of proteins and their supporting peptide matches consists of two steps with each filtering matches based upon the difference (⌬pH) between the calculated peptide pI value for the matched sequence and the measured pH value of the FFE fraction from which the peptide was identified. True peptide sequence matches should have pI values very close to the measured fraction pH value, whereas false matches are expected to have random pI values and be eliminated when using the ⌬pH filter (10, 15). The first step initially filtered the peptide sequence matches using a ⌬pH tolerance of Ϯ0.5, which we have shown to be the optimal ⌬pH tolerance based upon the IEF resolution using FFE (10). This filtering step allows for the p score threshold to be reduced while still maintaining a false positive rate below 1% (10). The optimal p score threshold using ⌬pH filtering will be different for each dataset being analyzed. As Fig. 1B shows, for this particular dataset the p score could be reduced to 0.76 when applying the ⌬pH filter, decreased from the p score threshold of 0.96 needed to achieve the same confidence without considering peptide pI. The second step filtered the peptide sequence matches using a low stringency p score threshold of 0.2 and peptide pI, again using a ⌬pH value of Ϯ0.5, with the added proviso that a protein would be added to the catalogue only if it was matched by two or more unique peptide sequences. This step is based upon the assumption that when combined with the peptide pI constraint, multiple peptide sequence matches provide added confidence to protein identification even when the matches have a low p score. Indeed using these criteria the calculated false positive rate for this filtering step was also below 1%.
Each filtering step added to the catalogue. The first step identified 433 proteins from peptide matches with a p score at or above the 0.76 threshold; each was added to the catalogue. 181 of these proteins had at least two peptide sequence matches, and the remainder had one peptide match. The second step identified and added to the catalogue another four proteins. At least one additional peptide sequence match was also added to 101 proteins (as indicated in the supplemental table) already in the catalogue, increasing the proteins identified by two or more peptide sequence matches to 221 of 437 total proteins. The supplemental table provides detailed information on this dataset, including all peptide sequence matches and the known biochemical functions and localizations of the identified proteins.

DISCUSSION
The use of peptide pI maximized the number of high confidence proteins identified in this study. Using p score filtering alone, without the use of peptide pI information, the minimum p score threshold is 0.96 to obtain a false positive rate below 1% (see Fig. 1B). Such a threshold would have resulted in the identification of only 385 proteins. The use of peptide pI and FFE fraction pH in our two filtering steps allowed for a decrease in the p score threshold, thereby producing a significantly larger catalogue of high confidence proteins. These additional peptide sequence matches would otherwise be false negative matches when using p score filtering alone that are sequence matches that are actually correct but do not pass the set scoring threshold (10,15). The combined filtering steps using peptide pI and FFE fraction pH also increased the sequence coverage of identified proteins with about half of the catalogued pro- FIG. 1. A, plot of calculated peptide pI and measured pH versus FFE fraction number and distribution of identified peptides. B, effect of p score and ⌬pH filtering on the false positive peptide sequence match rate. The inset plot indicates the p score values used to achieve a false positive rate of 1% or below using no ⌬pH filtering (p ϭ 0.96) and using peptide pI information with a ⌬pH tolerance of Ϯ0.5, which decreased the p score to 0.76. teins having two or more peptide sequence matches.
Our approach identified 437 proteins with high confidence (false positive rate below 1%). We compared our catalogue to those from other proteomic studies of saliva attempting to comprehensively identify proteins in saliva using non-gel electrophoresis-based strategies. One recent study using multidimensional liquid chromatography and tandem mass spectrometry identified 102 proteins in whole human saliva (8). These protein matches were statistically validated using reversed database searching, providing an estimated false positive rate below 1%. Most of their catalogue's proteins are contained in ours but not vice versa. Another recent report used both liquid chromatography-based separations and also two-dimensional gel separations to identify a combined 309 proteins from saliva (7). The overlap between their catalogue and ours was relatively small with most of the common proteins between the studies being those that have also been found in other proteomic studies, most likely indicative of their high abundance and housekeeping functions in saliva. By comparison with these other studies, our catalogue of proteins is the largest obtained from a single saliva sample to date, thereby providing new information on its composition.
Comparison of other catalogues with ours highlights an ongoing problem in the proteomics community: a lack of standards in publishing mass spectrometry-derived proteomic datasets (23,24). For example, in the case of the study described in Ref. 7, the dataset was non-transparently presented with little information on the criteria for determining correct peptide sequence matches provided and no estimate of false positive rates or detailed information on the scoring of peptide sequence matches. Furthermore the protein sequence database used outputted protein accession numbers for identified proteins from a variety of proteomic and genomic databases as opposed to non-redundant sequence databases such as the International Protein Index database (18) used in our present study that provide consistent accession number formats (e.g. Uniprot) for identified proteins. Collectively these factors make comparison of these large proteomic datasets difficult. As such, we hope that the dataset of saliva proteins we present here will serve as a model for publishing large scale proteomic data to the growing number of research groups investigating this clinically important bodily fluid, helping the dissemination and comparison of proteomic datasets obtained from future studies.
Acknowledgments-We gratefully acknowledge the Mass Spectrometry and Proteomics Center at the University of Minnesota for access to the mass spectrometer used in this work. We thank Patton Fast at the Minnesota Supercomputing Institute for help in setting up and maintaining the computer cluster used for sequence database searching.
* This work was supported in part by funding from the Minnesota Medical Foundation. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. □ S The on-line version of this article (available at http://www. mcponline.org) contains supplemental material.