Label-free Semiquantitative Peptide Feature Profiling of Human Breast Cancer and Breast Disease Sera via Two-dimensional Liquid Chromatography-Mass Spectrometry*S

A label-free semiquantitative peptide feature profiling method was developed in response to challenges associated with analysis of two-dimensional liquid chromatography-tandem mass spectrometry data. One hundred twenty human sera (49 from invasive breast carcinoma patients, 26 from non-invasive breast carcinoma patients, 35 from benign breast disease patients, and 10 from normal controls) were repeatedly analyzed using a standardized two-dimensional liquid chromatography-mass spectrometry method. Data were extracted using the novel semiquantitative peptide feature profiling method, which is based on comparisons of normalized relative ion intensities. Hierarchical cluster analyses and principle component analyses were used to evaluate the predicative capability of the extracted data, and results were promising. Extracted data were also randomly assigned to either a training group (65%) or to a test group (35%) for artificial neural network modeling. Models best identified invasive breast carcinomas (212 predictions, 94% accurate) and benign non-neoplastic breast disease (96 predictions, 81.3% accurate). These results suggest that, after further development, the novel method may be useful for large scale clinical proteomic profiling.

Serum and plasma proteomics may uncover diagnostically useful biomarkers (1). Identification of diagnostic signatures from human fluids via high resolution two-dimensional gel electrophoresis (2-D gel) 1 was first proposed more than 2 decades ago, but the idea was initially given little attention (2)(3)(4). Fortunately recent advances have rekindled the interest of researchers (5)(6)(7). Although useful (8,9), the capabilities of 2-D gel are limited. Most proteins detected using this method are high abundance maintenance enzymes. Low abundance proteins, membrane proteins, and proteins with extreme isoelectric points or molecular weights are less frequently identified (10,11). Relatively low throughput capacity and poor reproducibility also limit the utility of 2-D gel. Several novel approaches have been developed recently in response to these limitations, including SELDI-TOF-MS (12,13), LC-MS/MS (14 -17), and multidimensional liquid chromatography tandem mass spectrometry (18 -23) (also referred to as multidimensional protein identification technology (MudPIT)). Proteomic tools have been developed so extensively that high throughput analysis of sample sets is no longer the primary concern when proteomically profiling human serum. The wide dynamic ranges of protein concentrations and the masking of low abundance proteins by high abundance proteins (1) have also been addressed Received, November 28, 2005, and in revised form, March 9, 2006 Published, MCP Papers in Press, March 17, 2006, DOI 10.1074/ mcp.M500387-MCP200 1 The abbreviations used are: 2-D gel, two-dimensional gel electrophoresis; SPFP, semiquantitative peptide feature profiling (based on comparisons of ion normalized relative peak intensities detected in total ion chromatograms of 2-D LC-MS analyses); 2-D, two-dimensional; PCA, principal component analysis (Assesses dataset variance in terms of principle components. These components are variables that define a projection and encapsulate the maximum possible amount of variation. They are orthogonal (and therefore uncorrelated) to previous principle components of the same dataset.); HCA, hier-archical clustering analysis (an agglomerative statistical method that identifies observation clusters and that groups items based on similarities); ANN, artificial neural network (A processor network composed of "units" or "neurons," each potentially having local memory. Units are connected by unidirectional communication channels ("connections") that carry numeric (as opposed to symbolic) data. The units operate based only on local data and on input received from connections. Neural networks are typically algorithms or hardware and are modeled on components of animal brains. Most neural networks have "training" rules through which connection weights are adjusted based on presented patterns. Neural networks "learn" from and generalize based on examples.); SELDI-TOF-MS, surface-enhanced laser desorption ionization time-of-flight mass spectrometry (Proteins in solution are separated on protein chip arrays coated with chromatographic or biologically reactive surfaces. Subsequent ionization and desorption via TOF-MS facilitates protein differentiation based on molecular size differences.); ESI, electrospray ionization (A process through which ionized species in the gas phase are generated from an analyte-containing solution via highly charged fine droplets. The solution is sprayed from a narrow bore needle tip at atmospheric pressure in the presence of a strong electric field.); Q-TOF, quadrupole time-of-flight (one kind of mass spectrometry that detects the ions in a time-of-flight mass selector); MudPIT, multidimensional protein identification technology; DCIS, ductal carcinoma in situ. through chemical depletion and fraction techniques (15,18,24,25).
Having achieved substantial methodological developments in other areas, computational analysis and interpretation of enormous data sets from LC-MS-based technologies (e.g. 2-D LC-MS/MS or MudPIT) have become prominent concerns (26 -31). Protein identification is typically conducted immediately following 2-D LC-MS/MS analyses via protein database searches that match tandem mass spectra to peptide sequences (32)(33)(34)(35)(36). Commonly used protein databases include the National Center for Biotechnology Information non-redundant (NCBInr) database (37), Swiss-Prot (38), and International Protein Index (IPI) (39), and popular algorithms include those of SEQUEST (40) or Mascot software (41). Although these strategies simplify interpretation of large scale data and although they allow users to focus on identified proteins (42), they do have drawbacks. As noted by Nesvizhskii and Aebersold (43), protein inference problems inherent in current strategies require the attention of researchers (32,44,45). These problems include limitations of identification based on a single peptide, difficulties assigning exact peptide sequences to MS/MS spectra, difficulties comparing identification results acquired through different algorithms and/or databases, the absence of post-translation information in current protein databases, resultant difficulties in ascertaining differences between protein isoforms, difficulties integrating protein identification results and transcription data (i.e. DNA microarray data or RNA sequencing), and difficulties conducting quantitative proteomic profiling without isotope labeling.
Several approaches to overcoming these obstacles have been investigated recently. Clustering the multiple tandem mass spectra of peptides into one spectrum has been expected to improve sequence matching accuracy (46,47), and Mascot has developed tools that can effectively evaluate search results (48). Improving search protocols (49) and search engines (50), integrating suites of algorithms, creating new software (51), and utilizing label or label-free quantitative methods have all been attempted in an effort to overcome the aforementioned methodological limitations (52,53).
Recently some researchers, on the other hand, have noticed the potential of peptide profiling (53)(54)(55)(56)(57). Several have attempted semiquantitative or quantitative peptide profiling of LC-MS data. Normalization of elution times gathered from the same peptide under different chromatographic conditions has been attempted (58), but this technique has yet to be perfected. One intensive peptide profiling study utilized combinations of liquid chromatography elution times and mass values as profiling signatures (58). Li et al. (59) recently developed an LC-MS-based method and adjunct data analysis software. They have focused on semiquantitative profiling of low abundance peptides that are ignored in tandem scans due to intensity discrimination, and their method has been successfully used to profile 10 mice sera.
Despite this progress, there is still a need for a method that can be applied to larger data sets for clinical profiling and that is capable of revealing clinically relevant features. We have developed a label-free semiquantitative feature peptide profiling method that may offer these capabilities. The method was evaluated through analysis of 2-D LC-MS data collected from 120 human sera of breast disease patients, breast carcinoma patients, and normal persons.

Sample Collection
Human blood specimens were collected from volunteer patients who were enrolled in the Clinical Breast Care Project (founded by the United States Department of Defense). All patients were treated at Walter Reed Army Medical Center. Fresh blood was placed in vials marked with patient numbers and barcodes and placed on ice immediately thereafter. Plasma and serum fractions were then prepared according to the Clinical Breast Care Project standard operation protocol, which was approved by institutional review boards of Walter Reed Army Medical Center and the United States Department of Defense. Fresh blood, serum, and plasma samples were shipped in dry ice and delivered overnight to Windber Research Institute (Windber, PA). Upon arrival, specimens were immediately divided into aliquots of various volumes (between 1 ml and 15 l). All operations were performed on ice, and all vials were labeled in advance with sample numbers and barcodes. 49 invasive breast carcinoma sera, 13 ductal carcinoma in situ (DCIS) sera, 13 atypical hyperplasia sera, 35 benign breast disease sera, and 10 healthy control sera were collected and processed randomly (shown in Table I and Supplemental Table 1S). No thawed sera were utilized.

Sample Digestion
All specimens were digested in accordance with the aforementioned standard operation protocol (60). 10-l aliquots of each specimen (totaling ϳ0.  Number 165680050, ACROS Organics, Geel, Belgium) were then added, and the mixture was kept at room temperature for 1 h. Then 12.5 l of 200 mM iodoacetamide (98%, Catalog Number 122270250, ACROS Organics) were added. The reaction was run at room temperature in the dark for an additional hour. 2.5 l of 200 mM DL-1,4dithiothreitol were again added to react with the remaining iodoacetamide. The mixture was allowed to react for an additional hour at room temperature. 300 l of distilled water and 100 l of 100 mM ammonium bicarbonate were then added to achieve a pH of 7.5-8.0. Finally 10 g/25 l modified trypsin (Catalog Number V511A, Promega, Madison, WI) was added. Digestion was conducted at 58°C for 1 h or at 37°C overnight. After digestion, 2 l of formic acid were added to halt the reaction. The digested sample was diluted 20 times in preparation for MudPIT analyses. The salt step applied to the first dimensional strong cation exchange column was 10% mobile phase D (0.1% formic acid in 1 M ammonium chloride, 5% acetonitrile) and 0.1% formic acid, 5% acetonitrile as mobile phase C. This was followed by reverse phase separation on the second dimensional C 18 column. Two gradients were utilized: 5-65% mobile phase B (0.1% formic acid in acetonitrile) was run for 30 min, and 65-80% mobile phase B was run for 10 min. Mobile phase A was 0.1% formic acid in water. The sample injection value was 10 l, the flow rate of four pumps was 200 l/min, and the flow rate on the spray needle after the splitting T was 1 l/min.

2-D LC-MS Analysis
The LCQ DECA XP PLUS ion trap mass spectrometer was tuned weekly with 5 pmol/l angiotensin I to maintain an intensity level of high eϩ8 or low eϩ9. The voltage of the electrospray ion source was 3.80 kV, the capillary voltage was 37 V, and the capillary temperature was 150°C. The full automatic gain control target was ϳ2eϩ7, and the automatic gain control off ion time was 5 ms. Multiplier voltage of the detector was 850 V. All data collection was performed using the updated tune file (61).

Data Analyses
Data Export-All .raw data initially generated from 2-D LC-MS were first transformed into .txt files via Rawfile Version 1 (in-house software). This software can index raw data individually or in groups, and data files (averaging ϳ150 megabytes) can be transformed within 1 min. The resultant text files are peak lists containing three columns: mass scan number, m/z value, and ion peak intensity (shown in Supplemental Table 2S).
Hierarchical Clustering Analysis (HCA)-As a form of cluster analysis, HCA involves grouping similar items. This method is used for smaller sample sets (typically less than 250) (62) and was therefore well suited for analysis of 120 samples (Supplemental Table 1S). Spotfire 8.0 (Spotfire, Somerville, MA) was used to conduct HCA of 300 raw data sets from the 120 sera (Fig. 1). The export data were first summarized with the mass interval as 0.5 amu, and only the highest peak within each mass unit was selected for further normalization. Then taking the highest peak in the entire run as 100%, all the other peaks in the same run were normalized correspondingly. Finally HCA was conducted on the summarized and normalized data set. Results of HCA were presented in a dendrogram, which represents the similarity of two samples by distances between two columns (the smaller the distances, the more similar the samples) (62).
Principle Component Analysis (PCA)-As another kind of cluster, PCA identifies variance in principle components. These components are orthogonal (and therefore uncorrelated) to previous principle components of the same data set. The n-dimensional data set was transformed into a three-dimensional data set in preparation for PCA (63). These transformations reduce the original data to its most important dimensions, filter out noise, and facilitate more distinct cluster formation.
PCA was conducted using Clementine 8.0 (SPSS, Inc., Chicago, IL). The export data were first summarized with the mass interval as 1.0 amu, and only the highest peak within each mass unit was selected for the further normalization. Then for the normalization, taking the average peak intensity of each run as 1000, all the peak intensities in each run were adjusted correspondingly. Finally PCA was conducted on the summarized and normalized data set. Because input nodes were limited to 650, peptides detected in 57 min of 2-D LC-MS analysis were separated into two m/z value-based groups. The m/z range of the first group was 300 -950, and that of the second group was 950 -2000. A three-dimensional diagram of the results was created in which single samples are represented as individual spots (Fig. 2).
Artificial Neural Network (ANN) Modeling-ANNs are mathematical models with connection geometry analogous to neurons (64). These tools identify arbitrary nonlinear multiparametric functions from experimental data (64). Through trial and error and by using different connection weight combinations, ANNs are "trained" to recognized complex relationships between input and output. These tools are used for diagnosis and prognosis (65,66), for pattern recognition (67), for compound detection (68), and for biological functioning assessment (69). ANNs are currently the premier bioinformatic modeling tool because of their applicability to complex relationships and mechanisms (70).
ANN was conducted using Clementine 8.0 (SPSS, Inc.). The same summarized and normalized data set used for PCA was also assessed using ANNs. ANN modeling generates predictions in a manner similar to the human brain. The procedure involves entering sample data (input layers), artificial reasoning (hidden layers), and relating samples to pathological categories (output layers). Based on signalto-noise ratios, the top 20 peptide features and corresponding normalized relative intensities were selected from each pathological category and designated as input nodes, and six disease stages (invasive breast carcinoma, ductal carcinoma in situ, atypical hyperplasia, benign neoplastic disease, benign non-neoplastic disease, and normal) were designated as output nodes. Hidden layers are not related to biology or to data; they are simply paths that the computer uses to "think." One hundred twenty samples were randomly divided into two groups: a training group (ϳ65%) and a test group (ϳ35%). Training was randomly performed 12 times (Fig. 3), yielding 12 models. Corresponding input layers, hidden layers, and output layers are listed in Table II.

RESULTS
To test the predictive value of the data extracted via the semiquantitative peptide feature profiling (SPFP) method and the reproducibility and stability of 2D LC-MS platform, HCA was first conducted on 300 raw data sets of 120 sera (Fig. 1). Columns represent samples, and each pathological stage is designated by a different color. HCA results clearly reveal delineations between different disease stages. Most samples of the same pathological stage were clustered, and the distribution of stage groups in the dendrogram was reasonable.
The left end of the dendrogram was characterized by invasive samples. Middle and right dendrogram areas were characterized by in situ or atypical samples and benign or normal samples, respectively. In addition, data from multiple runs of the same sample were always clustered first, ant this supported the validity of data extracted by the SPFP method as well as the reproducibility and stability of our 2-D LC-MS platform.
PCA was conducted to independently confirm HCA results. Fig. 2 displays PCA results. Each sample is represented by a single spot, and samples from different pathological stages are differently colored. It appears obvious that those invasive samples are clustered and that they are delineated from noninvasive samples and benign and normal samples as well. PCA clearly confirmed HCA results, further supporting the predictive capability of SPFP-extracted data.
The overall predictive accuracy of 12 random ANN models for 501 tests was 77.8% (Table III), and the accuracy of individual models ranged from 70.6 to 84.8% (Table IV)

DISCUSSION
Development of SPFP Method-HPLC has been used for many years to quantitatively analyze proteins, peptides, and many types of metabolic molecules (71)(72)(73)(74). Combined with modern mass spectrometry, 2-D LC-MS produces quality data that contain rich quantitative information. Some researchers, however, have noted that it is difficult to conduct quantitative analyses without isotopic labeling. This perception may be a response to the incompleteness of all peptide tandem mass spectrometry elution peaks. Tandem mass spectrometry involves a standardized sequence of events: a full scan followed by a tandem scan. Because tandem scans focus on the collision of a single target peptide, elution behavior of all other peptides cannot be recorded simultaneously. It is, of course, difficult to conduct quantitative proteomic analyses without complete peak detection.
This argument is reasonable although not entirely correct. Two considerations are noteworthy: time and quantitative analysis criteria. Tandem mass spectrometry scan times are generally 200 ms or less for full and tandem scans. Average peptide elution times, on the other hand, are ϳ0.5 min for 2-D LC separation. The ratio of scan times to elution times is ϳ1:150. Although some elution points cannot be recorded when tandem scan events occur, the elution peaks of most peptides can be ascertained based on information from multiple full scans.
Using criteria other than peak area for quantitative analyses moreover may facilitate proteomic profiling without isotopic labeling. Peak asymmetry (a ratio of two half-peak widths) and peak height (the maximum height detected in an entire elution peak) may also be analyzed fruitfully. Peak asymmetry is usually used for purity testing. Peak height, like peak area, is commonly used in quantitative analyses. Peak area is the primary focus of most quantitative analyses because it provides more information than peak height. However, incomplete peptide elution peaks and the potential overlay of elution peaks with similar retention times and mass values both may render peak area less practical than peak height. Therefore, an SPFP method was developed to explore this and other methodological possibilities. Through the comparison of normalized peak intensity of the summarized peptide features extracted from the 2D LC-MS results of various samples, we hope to mine out the peptide features that are significant and pathologically relevant from the enormous data set. The proposed signal processing approach appears to work well, suggesting it may be helpful for routine compar-  ison of complex mixtures for the purpose of differential expression analysis and biomarker detection. Performance Inconsistency of ANN Modeling-Poor pathological characterization of the two non-invasive breast carcinomas is attributable to two factors. The first factor is disease stage distribution. Utilizing 120 samples and 300 extracted raw data sets, the present investigation is larger than any previous 2-D LC-MS breast cancer serum profiling study. However, the sample set was not well balanced. Again the set included 49 invasive breast carcinoma sera, 13 DCIS sera, 13 atypical hyperplasia sera, 24 benign non-neoplastic sera, 11 benign neoplastic sera, and 10 normal control sera. Although clinical sample collection provided reliably diagnosed sera, this procedure also yielded a suboptimal patho-logical distribution. "Cut-one-off" modeling may have facilitated accurate pathological characterization despite this imbalance, but a more stringent ANN modeling approach was utilized to prevent "overfitting." With less data available for ANN training, it is unsurprising that modeling of underrepresented disease stages yielded less accurate models (Table  IV).
The second reason for poor pathological characterization of non-invasive breast carcinomas is related to their distinctiveness. Differences between samples of opposite pathological extreme, of course, are better defined than differences between samples that are pathologically similar. Like ANN modeling, PCA and HCA clearly delineated invasive and normal/ benign non-neoplastic groups. However, they failed to delineate other stages with comparable precision. This also suggests that the extracted data from DCIS and atypical hyperplasia samples may not be representative of these pathologies.
Similar to the peptide array method recently published (59), SPFP is designed to computationally identify biomarker signals via analysis of raw 2-D LC MS data. Proposed identification methods all involve raw data exportation, summarization, normalization, extraction of features (e.g. peak apex, m/z value, or retention time), and alignment of corresponding features from different samples. Compared with the peptide array method (59), SPFP has two limitations. First, SPFP bypasses the monoisotopic test. This step was foregone because, unlike the ESI-Q-TOF mass spectrometers used in previous studies, the mass resolution of the LCQ ion trap mass spectrometer (approximately 0.5 amu) was insufficient for detection of double and triple charged ions in full mass scan. Second, previous approaches have used software to fix retention time shifts, and the adjusted retention times were then used as the third criterion for data alignment of peptide arrays. SPFP bypasses the retention time in initial data alignment, focusing primarily on m/z values and peak intensities. The disadvantage of this bypass is that the different ions with similar m/z values may occasionally be identified as a single profiling target.
Although data analysis methods utilized for SPFP may be less thorough, their simplicity may also be advantageous. Every data processing step introduces opportunities for error. Potential errors related to retention time are a good example. Our observations and a previous investigation (58) suggest that retention times are difficult to manually modify because the retention times of the peptide ions were inconsistent even in the same run of the same sample. In some cases, these retention times had opposite signs (one positive, the other negative). Retention time is certainly an important criterion for accurate alignment, but it is also troublesome. This exemplifies how errors can arise when adjusting retention time shift. The SPFP method may bypass some risk of such errors due to the simple data process.
The approach described herein and the other previously developed approaches all have advantages and limitations. It is not yet clear which of these approaches will be useful, but it is clear that all are worthy of further exploration. The proposed SPFP signal processing approach appears to work well on a large clinical data set, suggesting it may be helpful for routine comparison of complex mixtures for the purpose of differential expression analysis and biomarker detection. Future research will focus on improving the data extraction algorithms (i.e. may involve the retention time) and utilizing the larger and well balanced sample set to improve the predictive accuracy.
Acknowledgments-We sincerely appreciate the kind support from Dr. Yonghong Zhang on data analyses. We also appreciate Dr. Richard Mural for the kind review and valuable feedback. We thank all colleagues at Winder Research Institute and Walter Reed Army Medical Center. We are grateful for the support of participating patients and their families.