Candidate Serological Biomarkers for Cancer Identified from the Secretomes of 23 Cancer Cell Lines and the Human Protein Atlas*

Although cancer cell secretome profiling is a promising strategy used to identify potential body fluid-accessible cancer biomarkers, questions remain regarding the depth to which the cancer cell secretome can be mined and the efficiency with which researchers can select useful candidates from the growing list of identified proteins. Therefore, we analyzed the secretomes of 23 human cancer cell lines derived from 11 cancer types using one-dimensional SDS-PAGE and nano-LC-MS/MS performed on an LTQ-Orbitrap mass spectrometer to generate a more comprehensive cancer cell secretome. A total of 31,180 proteins was detected, accounting for 4,584 non-redundant proteins, with an average of 1,300 proteins identified per cell line. Using protein secretion-predictive algorithms, 55.8% of the proteins appeared to be released or shed from cells. The identified proteins were selected as potential marker candidates according to three strategies: (i) proteins apparently secreted by one cancer type but not by others (cancer type-specific marker candidates), (ii) proteins released by most cancer cell lines (pan-cancer marker candidates), and (iii) proteins putatively linked to cancer-relevant pathways. We then examined protein expression profiles in the Human Protein Atlas to identify biomarker candidates that were simultaneously detected in the secretomes and highly expressed in cancer tissues. This analysis yielded 6–137 marker candidates selective for each tumor type and 94 potential pan-cancer markers. Among these, we selectively validated monocyte differentiation antigen CD14 (for liver cancer), stromal cell-derived factor 1 (for lung cancer), and cathepsin L1 and interferon-induced 17-kDa protein (for nasopharyngeal carcinoma) as potential serological cancer markers. In summary, the proteins identified from the secretomes of 23 cancer cell lines and the Human Protein Atlas represent a focused reservoir of potential cancer biomarkers.

Cancer is a major cause of mortality worldwide, accounting for 10 million new cases and more than 6 million deaths per year. In developing countries, cancer is the second most common cause of death, accounting for 23-25% of the overall mortality rate (1). Notwithstanding improvements in diagnostic imaging technologies and medical treatments, the long term survival of most cancer patients is poor. Cancer therapy is often challenging because the majority of cancers are initially diagnosed in their advanced stages. For example, the 5-year survival rate for patients with HNC 1 is less than 50%. More than 50% of all HNC patients have advanced disease at the time of diagnosis (2,3). Enormous effort has been devoted to screening and characterizing cancer markers for the early detection of cancer. Thus far, these markers include carcinoembryonic antigen, prostate-specific antigen, ␣-fetoprotein, CA 125, CA 15-3, and CA . Unfortunately, most biomarkers have limited specificity, sensitivity, or both (4). Thus, there is a growing consensus that marker panels, which are more sensitive and specific than individual markers, would increase the efficacy and accuracy of early stage cancer detection (4 -8). The development of novel and useful biomarker panels is therefore an urgent need in the field of cancer management.
Proteomics technology platforms are promising tools for the discovery of new cancer biomarkers (9). Over the past decade, serum and plasma have been the major targets of proteomics studies aimed at identifying potential cancer biomarkers (10 -13). However, the progress of these studies has been hampered by the complex nature of serum/plasma samples and the large dynamic range between the concentrations of different proteins (14). As cancer biomarkers are likely to be present in low amounts in blood samples, the direct isolation of these markers from plasma and serum samples requires a labor-intensive process involving the depletion of abundant proteins and extensive protein fractionation prior to mass spectrometric analysis (15)(16)(17)(18). Alternatively, the secretome, or group of proteins secreted by cancer cells (19), can be analyzed to identify circulating molecules present at elevated levels in serum or plasma samples from cancer patients. These proteins have the potential to act as cancer-derived marker candidates, which are distinct from host-responsive marker candidates. We, along with other groups, have demonstrated the efficacy of secretome-based strategies in a variety of cancer types, including NPC (20), breast cancer (21,22), lung cancer (23,24), CRC (25,26), oral cancer (27), prostate cancer (28,29), ovarian cancer (30), and Hodgkin lymphoma (31). In these studies, proteins secreted from cancer cells into serum-free media were resolved by one-or two-dimensional gels followed by in-gel tryptic digestion and analysis via MALDI-TOF MS or LC-MS/MS. Alternatively, the proteins were trypsin-digested in solution and analyzed by LC-MS/MS. In general, more proteins were detected in the secretome using the LC-MS/MS method than the MALDI-TOF MS method. Advanced protein separation and identification technologies have made it possible to detect more proteins in the secretomes of cancer cells, thereby facilitating the discovery of cancer biomarkers.
Although the cancer cell secretomes of various tumor types have been individually analyzed by different groups using distinct protocols, few studies have used the same protocol to compare cancer cell secretomes derived from different tumor types. We previously assessed the secretomes of 21 cancer cell lines derived from 12 cancer types (i.e. consisting of 795 protein identities and 325 non-redundant proteins) by onedimensional gel and MALDI-TOF MS (25). Our preliminary findings revealed that different cell lines have distinct secreted protein profiles and that several putative biomarkers, such as Mac-2BP (20,26,27,29) and cathepsin D (21,23,32), present in the secretome of a given cancer cell type are commonly shared among different cancers. These observations suggest that an in-depth comparison of secretomes derived from different tumor types may identify marker candidates common to most cancers as well as markers for specific cancer types. As an increasing number of proteins are identified in the secretomes of various cancer cell lines, scientists are faced with the challenge of quickly and efficiently narrowing down the list to candidates with higher chances of success during validation testing with precious clinical specimens.
In the present study, we applied one-dimensional SDS-PAGE in conjunction with nano-LC-MS/MS (GeLC-MS/MS) (33,34) to analyze the conditioned media of 23 cancer cell lines derived from 11 cancer types, including NPC, breast cancer, bladder cancer, cervical cancer, CRC, epidermoid carcinoma, liver cancer, lung cancer, T cell lymphoma, oral cancer, and pancreatic cancer. Within this data set, 4,584 non-redundant proteins were identified from a total of 23 cell lines, yielding an average of ϳ1,300 proteins per cell line. Potential marker candidates were identified via the comparative analysis of different cell line secretomes and by putative linkages to cancer-relevant pathways. The selected proteins were further compared with the HPA (35) to generate a focused data set of proteins that are secreted or released, cancer type-specific, and highly expressed in human cancer tissues. Finally, we selectively validated four proteins as potential serological cancer markers using blood samples from cancer patients. Preparation of Conditioned Media and Cell Extracts-Conditioned media from the various cancer cell lines were collected and processed as described previously (25). Briefly, cancer cells were grown to confluence in 15-cm tissue culture dishes. The cells were washed with serum-free media and incubated in serum-free media for 24 h. The supernatants were then harvested and centrifuged to remove the intact cells followed by centrifugation in Amicon Ultra-15 tubes (molecular mass cutoff, 5,000 Da; Millipore, Billerica, MA) to concentrate and desalt the supernatants. Cells remaining on the dishes were washed twice with PBS and lysed in hypotonic buffer (10 mM Tris, pH 7.4, 1 mM EDTA, 1 mM EGTA, 50 mM NaCl, 50 mM NaF, 20 mM Na 4 P 2 O 7 , 1 mM Na 3 VO 4 , 1 mM PMSF, 1 mM benzamidine, 0.5 g/ml leupeptin, and 1% Triton X-100) on ice for 15 min. The cell lysates were collected, sonicated on ice, and centrifuged at 10,000 ϫ g for 25 min at 4°C. The resulting supernatants were used as cell extracts (20). The protein concentrations of the various samples were determined using the BCA protein assay reagent from Pierce.
One-dimensional SDS-PAGE and In-gel Protein Digestion-In preparation for secretome analyses, proteins (50 g) were applied to 8 -14% gradient gels for SDS-PAGE. After staining by 0.5% Coomassie Brilliant Blue G-250 (AppliChem GmbH, Darmstadt, Germany), the gel lane was cut into 70 pieces and subjected to in-gel tryptic digestion as described by Wu et al. (25). Briefly, the gel pieces were destained in 10% methanol (Mallinckrodt Baker), dehydrated in acetonitrile (Mallinckrodt Baker), and dried using a SpeedVac. The proteins were reduced with 25 mM NH 4 HCO 3 containing 10 mM dithiothreitol (Biosynth AG) at 60°C for 30 min and alkylated with 55 mM iodoacetamide (Amersham Biosciences) at room temperature for 30 min. After reduction and alkylation, proteins were digested via overnight incubation with sequencing grade modified porcine trypsin (20 g/ml) (Promega, Madison, WI) at 37°C. Peptides were extracted using acetonitrile and dried in a SpeedVac.
Reverse-phase Liquid Chromatography/Tandem Mass Spectrometry-To analyze the cancer cell secretomes, each peptide mixture was reconstituted in HPLC buffer A (0.1% formic acid; Sigma), loaded across a trap column (Zorbax 300SB-C18, 0.3 ϫ 5 mm; Agilent Technologies, Wilmington, DE) at a flow rate of 0.2 l/min in HPLC buffer A, and separated on a resolving 10-cm analytical C 18 column (inner diameter, 75 m) using a 15-m tip (New Objective, Woburn, MA). The peptides were eluted using a linear gradient of 0 -10% HPLC buffer B (i.e. 99.9% ACN containing 0.1% formic acid) for 3 min, 10 -30% buffer B for 35 min, 30 -35% buffer B for 4 min, 35-50% buffer B for 1 min, 50 -95% buffer B for 1 min, and 95% buffer B for 8 min with a flow rate of 0.25 l/min across the analytical column.
The LC apparatus was coupled with a two-dimensional linear ion trap mass spectrometer (LTQ-Orbitrap, Thermo Fisher, San Jose, CA), which was operated using Xcalibur 2.0 software (Thermo Fisher). Intact peptides were detected in the Orbitrap at a resolution of 30,000. Internal calibration was performed using the ion signal of (Si(CH 3 ) 2 O) 6 H ϩ at m/z 445.120025 as a lock mass (36). We used a data-dependent procedure that alternated between one MS scan and six MS/MS scans for the six most abundant precursor ions in the MS survey scan. The m/z values selected for MS/MS were dynamically excluded for 180 s. The electrospray voltage applied was 1.8 kV. Both MS and MS/MS spectra were acquired using the one microscan with a maximum fill time of 1,000 and 100 ms for MS and MS/MS analyses, respectively. Automatic gain control was used to prevent overfilling of the ion trap, and 5 ϫ 10 4 ions were accumulated in the ion trap for the generation of MS/MS spectra. The m/z scan range for MS scans was 350 -2,000 Da.
Bioinformatics-The resulting MS/MS spectra were used to search the non-redundant IPI human sequence database Version 3.26 (released February 2007; 67,665 sequences; 28,353,548 residues) of the European Bioinformatics Institute using the SEQUEST algorithm (Thermo Fisher). Up to two missed cleavages were allowed, and searches were performed with variable oxidation of methionine residues (16 Da) and fixed modification for carbamidomethylcysteines (57 Da). A fragment ion mass tolerance of 0.5 Da and a parent ion mass tolerance of 10 ppm were used for the search engine with trypsin as the digestion enzyme. The random sequence database was used to estimate false-positive rates for peptide matches, and the falsepositive rate for the peptide sequence matches using the criteria was estimated to be Ͻ1% via random database searching.
Protein identities were validated using the open source TPP software (Version 3.3). The SEQUEST search resulted in a DTA file. The raw data and DTA files containing information about identified peptides were then processed and analyzed in the TPP. The TPP software includes a peptide probability score program, PeptideProphet, that aids in the assignment of peptide MS spectra (37), as well as a ProteinProphet program that assigns and groups peptides to a unique protein or a protein family if the peptide is shared among several isoforms (38). ProteinProphet allows for the filtering of large scale data sets with assessment of predictable sensitivity and falsepositive identification error rates. We used PeptideProphet and ProteinProphet probability scores Ն0.95 to ensure an overall falsepositive rate below 0.5%. Furthermore, proteins with single peptide identities were excluded from this study. Information about the PeptideProphet and ProteinProphet programs can be obtained from the Seattle Proteome Center at Institute for Systems Biology.
We used the SignalP program with hidden Markov models to predict the presence of secretory signal peptide sequences (39,40). In addition, we used the SecretomeP program to predict non-signal peptide-triggered protein secretion (41) and the TMHMM to predict transmembrane helices in proteins (42). The identified proteins were further analyzed using ProteinCenter (Proxeon Bioinformatics, Odense, Denmark), a proteomics data mining and management software, to compare cell line secretomes with each other, functionally categorize the identified proteins, and calculate the emPAI (43,44).
Hierarchical Clustering-The emPAI values of identified proteins were imported into Microsoft Excel. If a protein was identified in one cell line but not the other, half the minimum emPAI value from the data set was assigned to that protein to facilitate visualization and comparison. All values were then transformed to Z scores, a commonly used normalization method for microarray data (45). The Z scores were calculated as Z ϭ (X Ϫ x )/ x where X is the individual emPAI value, x is the mean of emPAI values for a identified protein across cell lines, and x is the standard deviation associated with x . A spreadsheet containing the Z scores was uploaded to the Partekா Genome Suite (Partek Inc., St. Louis, MO) and analyzed using a two-way hierarchical clustering algorithm according to Pearson distance and Ward's aggregation method. Cell lines and proteins were organized into mock phylogenetic trees (dendrograms) with the cell lines shown along the x axis and the proteins along the y axis.
Network Analysis-Proteins selected from the clustering analysis were converted into gene symbols and uploaded into MetaCore (GeneGo, St. Joseph, MI) for biological network building. MetaCore consists of curated protein interaction networks based on manually annotated and regularly updated databases. The databases describe millions of relationships between proteins according to publications on proteins and small molecules. The relationships include direct protein interactions, transcriptional regulation, binding, enzyme-substrate interactions, and other structural or functional relationships. The "shortest paths" and "analyze network" algorithms were used to map the hypothetical networks of uploaded proteins. The relevant pathway maps were then prioritized based on their statistical significance with respect to the uploaded data sets.
Patient Population and Clinical Specimens-Plasma samples were collected from 45 healthy controls (i.e. 32 men and 13 women ranging in age from 43 to 77 years; mean, 62.2 years), 44 patients with liver cancer (i.e. 32 men and 13 women ranging in age from 44 to 77 years; mean, 63.4 years), and 44 patients with lung cancer (i.e. 28 men and 16 women ranging in age from 32 to 88 years; mean, 64.7 years). Serum samples were collected from 45 healthy controls (i.e. 26 men and 19 women ranging in age from 21 to 72 years; mean, 48.2 years) and 45 NPC patients (i.e. 32 men and 13 women ranging in age from 16 to 79 years; mean, 47.2 years). All the blood samples were collected at Chang Gung Memorial Hospital. The study protocol was approved by the Medical Ethics and Human Clinical Trial Committee at Chang Gung Memorial Hospital. All patients entered in the study signed an informed consent.
Blood samples were collected from the patients preoperatively following a standardized protocol. Plasma and serum samples were prepared by collecting blood in EDTA and empty tubes, respectively, and left at room temperature (for a maximum of 30 min) until centrifugation. Plasma samples were centrifuged at 2,000 ϫ g for 10 min at room temperature to pellet the cells. Serum samples were centrifuged at 1,500 ϫ g for 10 min at 4°C. After centrifugation, samples were divided into 1.0-ml aliquots in sterile cryotubes and immediately frozen at Ϫ80°C for storage until ELISAs. The samples had only undergone one freeze/thaw cycle before the measurements were conducted.
Western Blot Analysis-The prepared samples (20 g of protein) were separated by SDS-PAGE, transferred to PVDF membranes (Millipore), and probed with various antibodies (i.e. anti-fascin, anti-BIGH3, anti-PAI-1 (Santa Cruz Biotechnology), and anti-␤-tubulin (MDbio, Taipei, Taiwan)) as described previously (20,27). Polyclonal antibodies specific to prosaposin were produced in rabbits using recombinant proteins, and these antibodies were affinity-purified as described in the supplemental Materials and Methods. Proteins of interest were detected with alkaline phosphatase-conjugated goat anti-rabbit IgG antibodies (Santa Cruz Biotechnology) and visualized using the CDP-Star TM chemiluminescent substrate (Roche Applied Science) according to the manufacturer's protocol.
ELISA-The concentrations of four candidate proteins were measured by ELISA in the blood samples of healthy controls and cancer patients. The concentrations of CD14 (ELISA kit from R&D Systems, Minneapolis, MN), stromal cell-derived factor 1 (SDF-1) (ELISA kit from R&D Systems), and cathepsin L1 (ELISA kit from Bender Med-Systems) were measured according to their respective manufacturer's instructions. An ELISA developed in house was used to measure ISG15 as described in the supplemental Materials and Methods.
Statistical Analysis-For the analysis of ELISA results, continuous measures were summarized using means, standard deviations, medians, and interquartile ranges. Differences between controls and cancer patients in blood concentrations of CD14, SDF-1, cathepsin L1, and ISG15 were performed using the nonparametric Mann-Whitney U test. Statistical analyses were conducted using SPSS software (Version 13.0, SPSS Inc., Chicago, IL). Two-tailed p values of 0.05 or less were considered significant.

Secretomes of 23 Cancer Cell
Lines-We previously examined the secretomes of 21 cancer cell lines using onedimensional SDS gels and MALDI-TOF MS and identified an average of 38 secreted proteins in each cell line (25). More recently, we identified 1,096 and 1,830 proteins in the secretomes of two lung cancer cell lines (CL 1-0 and CL 1-5 , respectively) using the GeLC-MS/MS strategy (46). To facilitate the secretome-based discovery of cancer biomarkers, we have now used the GeLC-MS/MS strategy to perform an in-depth analysis of conditioned media from 21 cancer cell lines derived from 10 cancer types, including NPC (NPC-TW02, NPC-TW04, and NPC-BM1), colon cancer (Colo205, SW480, and SW620), HCC (SK-Hep-1, Hep G2, and Hep 3B), oral cancer (OEC-M1 and SCC-4), bladder cancer (U1 and U4), breast cancer (MCF7 and MDA-MB-435S), cervical cancer (C-33A and HeLa), pancreatic cancer (PANC-1 and MIA-PaCa-2), epidermoid carcinoma (A431), and T cell lymphoma (Jurkat). A schematic diagram of the procedure is shown in Fig. 1. Proteins (50 g) in the conditioned media collected from cells cultured in serum-free media for 24 h were resolved by SDS-PAGE, visualized by Coomassie Blue staining, consecutively sliced into 70 pieces, digested individually with trypsin, and analyzed by LC-ESI-MS/MS on an LTQ-Orbitrap MS. The protein patterns observed in conditioned media from 21 cancer cell lines and two lung cancer cell lines are shown in Fig.  2A. As a quality control, we performed Western blot to examine the distribution of ␤-tubulin, an abundant cytosolic protein, between the total cell lysates and conditioned media.
␤-Tubulin was clearly detected in the total cell extracts but not in the conditioned media (Fig. 2B). We also found that serum starvation for 24 h had little effect on cell viability compared with cells cultured in the presence of 10% serum (supplemental Fig. 1). These observations collectively indicate that recovery of proteins in the conditioned media was not due to cell death.
The resulting MS/MS spectra were used in a search of the non-redundant IPI human sequence database (Version 3.26) using the SEQUEST algorithm. The search results were analyzed using the open source TPP software (Version 3.3) with stringent criteria regarding peptide probability (Ն0.95) and protein probability (Ն0.95) (38). Proteins present in the two lung cancer cell lines (46) as well as proteins identified by this analysis are summarized in Table I. Emphasis was placed on proteins identified by multiple (i.e. at least two) peptides because the chances of false-positive results decrease exponentially with each additional peptide identified (47). After setting the cutoffs, an average of 1,356 proteins per cell line were examined, and a total of 31,180 protein identities, accounting for 4,584 non-redundant proteins, were detected (Table I and supplemental Table 1). In addition to identifying proteins in each cancer cell secretome, we applied the emPAI (43) to estimate the abundance of each protein in the secretome of each cell line (supplemental Table 1). The false dis- covery rate (FDR) of peptide detection was empirically determined by searching the data set against a random IPI Human database (Version 3.26) using the same search parameters and TPP cutoffs. The FDRs determined for each cell line are shown in Table I; all were Ͻ1%.
Distribution and Ontology Analysis of Identified Proteins-The identified proteins were further analyzed using bioinformatics programs designed to predict protein secretion pathways (Table II and supplemental Table 2). Among the 4,584 non-redundant proteins identified, the SignalP program predicted that 998 proteins were secreted in the classical secretory pathway (i.e. the endoplasmic reticulum/Golgi-dependent pathway; SignalP probability Ն0.90) based on the presence of a signal peptide (39,40). The SecretomeP program predicted that 1,438 proteins were released via the nonclassical secretory pathway (SignalP probability Ͻ0. 90 and SecretomeP score Ն0.50) (41). In addition, the TMHMM determined that 121 integral membrane proteins were not secreted via the classical or nonclassical secretion pathways (42). The predicted secretion pathways of the proteins in each cell line are summarized in Table II and supplemental Table 1. Collectively, these analyses predicted that 55.8% (2,557 of 4,584) of the identified proteins were released into the conditioned media of cultured cancer cells via different mechanisms. It should be noted that many chemokines, cytokines, and growth factors, which are known as very low abundance secreted proteins, could be detected in the secretomes of various cancer cell lines (supplemental Table 3), thereby demonstrating the sensitivity of the GeLC-MS/MS strategy. To evaluate the effectiveness of this protocol with regard to secretome analysis, we analyzed proteins extracted from lysates of NPC-TW04 and A431 cells that remained on culture dishes after the removal of conditioned media. The results  showed that only 34.0% (415 of 1,219) and 33.8% (395 of 1,169) of the proteins in NPC-TW04 and A431 cells, respectively, were predicted to be secreted (data not shown). Of the 4,584 proteins identified in this report, 1,241 (27.1%) were found in the Human Plasma Proteome Project database (48) (supplemental Table 2).
ProteinCenter software was used to predict the functions of the 4,584 identified proteins based on universal GO annotation terms. These proteins were linked to at least one annotation term within the GO molecular function and biological process categories, respectively. As shown in Fig. 3A, the top three most common molecular functions were protein binding (63.4%), catalytic activity (61.3%), and metal ion binding (30.6%). The major biological process categories included metabolic processes (73.8%) and regulation of biological processes (34.5%) followed by cell organization (33.7%) and cell communication (26.8%) (Fig. 3B). The results of our GO analysis of identified proteins in the molecular function and biological process categories are shown in supplemental Tables 4 and 5, respectively.
Overlap of Identified Proteins between All Cell Lines Examined-The proteins identified among the 23 cell lines were analyzed for overlapping members (Table III and supplemental Table 2). One hundred and seventy-two proteins (3.8% of the 4,584 proteins) were detected in all cancer cell secretomes. About 23.0% of the 4,584 proteins were detected in more than half (Ն12) of the cell lines, and 35.1% were found in 3-11 cell lines. Nearly one-third (i.e. 29.3%) of the 4,584 proteins were uniquely detected in the secretome of one of the 23 cell lines, and 12.6% (576 proteins) were identified in two of the 23 cell lines.
To reduce the number of potential tumor marker candidates, we combined proteins identified in the secretomes of cell lines from each cancer type to form a list of non-redundant proteins for each cancer type. These 11 lists were used to assess the overlap in identified proteins (Table IV and  supplemental Table 6). A significant portion (36.3%) of the proteins were found in more than half (at least six) of the cancer types; 33.6% (1,539 proteins) were detected in two to five cancer types, and 30.1% (1,381 proteins) were detected in a single cancer type. Taken together, these data reveal that cell lines from different tumor origins secrete/ release hundreds of common proteins and that cancer cell lines can also secrete/release proteins unique to a specific cancer type.
Evaluation of Potential Cancer-specific Biomarkers-Detection of proteins that are uniquely released by each cancer type might facilitate the discovery of biomarkers for individual cancers. Thus, we focused our attention on the 1,381 proteins that were uniquely detected in the secretomes of a specific cancer type. To efficiently narrow down our candidate list of potential cancer-specific biomarkers, we consulted the HPA. This database contains the immunohistochemical (IHC) staining profiles of numerous proteins in a variety of cancerous and non-cancerous tissues based on more than 8,800 antibodies (35). We searched all 1,381 proteins in the HPA database and selected those whose expression has been examined in corresponding cancer tissues from a small number of patients. The IHC staining profiles of corresponding non-cancer tissues in the HPA were also analyzed, although only three or fewer than three biopsies were available (supplemental Table 7). We found that 603 of 1,381 proteins have been examined in their corresponding tumor tissues (Table V). Among these, 77.8% (469) of the proteins were detected in more than 50% of the tumor tissue sections (Table V). The IHC staining results for the 603 proteins and their corresponding cancer types from the HPA database are summarized in Table V  and supplemental Table 7.
The following examples illustrate the ability of our analyses to identify multiple marker candidates that warrant further validation (Table VI). Among the 40 proteins detected in most CRC tissues (Table V), cell surface A33 antigen was found to be mainly negative in other cancer types, whereas neutral amino acid transporter A, isoform CSBP1 of mitogen-activated protein kinase 14, and bone morphogenetic protein 4 were overexpressed in CRC relative to other cancers. Among    (Table  V), bile salt sulfotransferase, ornithine carbamoyltransferase, monocyte differentiation antigen CD14, and isoform 1 of asialoglycoprotein receptor 2 were less immunoreactive in tissues of other cancer types, whereas multidrug resistance protein 1 and vitamin K-dependent protein C were overexpressed in HCC versus other cancers. Compared with other cancers, bladder cancer tissues reacted more strongly with proteins such as cadherin-6, squalene synthetase, ribophorin II, and 15-hydroxyprostaglandin dehydrogenase. The levels of neu-rogenic locus notch homolog protein 3 and trefoil factor 1 were higher in breast cancer tissues versus tissues of other cancers. Compared with non-tumor lung tissues, stromal cellderived factor 1 (CXCL12) reacted more strongly with lung cancer tissues.
Evaluation of Potential Pan-cancer Biomarkers-The identification of proteins that have a high frequency of detection in the conditioned media of different cell lines clearly revealed proteins commonly released by various cancer cells. A total of 172 proteins (3.8%) were found in the conditioned media of all  a Proteins that were solely detected in the secretome of a specific cancer type were immunohistochemically examined in the corresponding cancer tissue sections of the HPA.
b Proteins that were solely detected in the secretome of a specific cancer type were immunohistochemically detected in more than half of the tumor tissue sections examined in the HPA. cell lines examined (Table III and supplemental Table 8). To evaluate the potential of these proteins to serve as pancancer marker candidates, we evaluated their expression in the tumor tissues of nine cancer types in the HPA database, including breast, cervix, colon, head and neck, liver, lung, pancreas, skin, and bladder cancers (35). In the HPA database (Version 5.0), 114 of the 172 proteins had been analyzed by IHC staining (supplemental Table 8). Among the proteins detected in more than half of the tumor tissue sections, 70.2% (80 of 114) of the proteins were observed in all nine tumor types, and 12.3% (14 of 114) of the proteins were detected in eight of nine cancer types (supplemental Table 8). Moreover, 45 proteins were detected in human plasma samples as documented in the Human Plasma Proteome Project (supplemental Table 8), and eight of 45 proteins showed negative or weak staining in over half of the nine corresponding normal tissue types (Table VII). These observations suggest that secreted proteins common to multiple cancer cell lines are potential pan-cancer markers.
Hierarchical Clustering Analysis for Pathway-based Biomarker Searches-A new approach toward biomarker discovery was recently proposed wherein pathways are monitored and targeted rather than individual proteins (49,50). Many secreted proteins appear to play important roles in the cancer microenvironment (51); thus, we attempted to cluster proteins according to their abundance in the conditioned media from each cancer cell line in an effort to identify potential pathways involved in the regulation of cancer microenvironments. Toward this end, we calculated the emPAI values of all proteins identified in the conditioned media of 23 cell lines, transformed these values to Z scores, and analyzed these data via unsupervised hierarchical classification as described under "Experimental Procedures." To examine the ability of emPAIbased Z scores to calculate the relative abundance of proteins in the conditioned media, we compared the Z score values of four selected targets (i.e. BIGH3, fascin, PAI-1, and prosaposin) with their corresponding signal intensities as determined by Western blot analyses of conditioned media (supplemental Fig. 2). There was a significant correlation between emPAIbased Z scores and Western blot-based Z scores, suggesting that emPAI-based Z scores can be used to estimate the relative abundance of proteins in conditioned media.
When proteins detected in the conditioned media were clustered according to Z scores, the three NPC cell lines and the two lung cancer cell lines clustered together. However, the other cell lines could not be categorized by tissue type (Fig.  4A and supplemental Fig. 3). We further selected the 79 proteins with the most different features used to distinguish the NPC cell lines (Fig. 4B and supplemental Table 9).
We then used MetaCore software to build biological networks and analyze the possible biological linkages between these 79 proteins. When the shortest paths algorithm was used to map the shortest interaction path and obtain a global view, 36 proteins were brought together in the net- works (supplemental Fig. 4). The networks shown in supplemental Fig. 4 demonstrate the enormous number of complex interactions between the 36 identified proteins and various intracellular signaling proteins. The 79 proteins were analyzed using the MetaCore analyze network algorithm to further explore their involvement in various biological processes. This analysis revealed a significant number of networks involved in cell adhesion and migration ( Fig. 5A; p ϭ 2.5 ϫ 10 Ϫ26 ) and immune system regulation ( Fig. 5B; p ϭ 5.7 ϫ 10 Ϫ22 ). The 24 proteins involved in both networks are listed in Table VIII. Among them, fibronectin is a potential NPC serum biomarker (20), laminin subunit ␥-1 is overexpressed in NPC via the down-regulation of mir-29c microRNA (52), and cathepsin L is highly expressed in NPC, and its overexpression correlates with lymph node metastasis and distant metastasis (53). These observations support the feasibility of a pathwaybased search strategy for biomarker discovery and furthermore suggest that the 22 additional proteins in the networks described above are potential NPC biomarkers that warrant further investigation.
Validation of Monocyte Differentiation Antigen CD14, Stromal Cell-derived Factor 1, Cathepsin L1, and Interferon-induced 17-kDa Protein as Potential Serological Cancer Biomarkers-To determine the clinical relevance of the results described above, we used ELISA to detect the levels of a potential liver cancer marker known as monocyte differentiation antigen CD14 (Table VI), a potential lung cancer marker known as SDF-1 (or CXCL12) (Table VI), and two potential NPC markers (i.e. cathepsin L1 and interferon-induced 17-kDa protein (ISG15)) (Table VIII) in serum or plasma samples from cancer patients and healthy controls. The CD14 and SDF-1 markers were selected based on a combined analysis of secretomes from 23 cell lines and the HPA, whereas cathepsin L1 and ISG15 were selected via the pathway-based strategy. In our data set, CD14 was selectively detected in the secretome of HepG2 cell line on the basis of three tryptic peptides (i.e. AFPALTSLDLSDNPGLGER, LTVGAA-QVPAQLLVGALR, and TGTMPPLPLEATGLALSSLR). In the HPA database, expression of CD14 (detected using the HPA002127 antibody) was observed in different cell types in a variety of normal human tissues but not in bile duct cells or hepatocytes in normal liver tissue. Interestingly, positive CD14 staining was observed at a much higher rate (i.e. nine of 11 times) in HCC specimens than in the other 19 cancer types according to data obtained from the HPA database (supplemental Fig. 5). The SDF-1 marker was selectively detected in the CL 1-5 secretome based on the presence of two tryptic peptides (FFESHVAR and ILNTPNCALQIVAR). In the HPA database, SDF-1 expression (detected using the CAB017564 antibody) was found in different cell types in a variety of normal human tissues, including macrophages in the lung, but was not found in lung alveolar cells. Positive staining of SDF-1 was observed in many cancer types, including seven of 10 lung cancer specimens, according to data obtained from the HPA database (supplemental Fig. 6). As shown in Fig. 6A, the plasma levels of CD14 were statistically higher in patients with liver cancer (n ϭ 44) than in healthy controls (n ϭ 45) (1.83 Ϯ 0.36 and 1.43 Ϯ 0.29 g/ml, respectively; p ϭ 0.0117). The SDF-1 plasma levels were significantly higher in patients with lung cancer (n ϭ 44) than in healthy controls (n ϭ 44) (4.01 Ϯ 1.57 and 3.11 Ϯ 0.56 ng/ml, respectively; p ϭ 0.0007) (Fig. 6B). Finally, the serum levels of cathepsin L1 (Fig. 6C) and ISG15 (Fig. 6D) were elevated in patients with NPC relative to healthy controls (6.87 Ϯ 3.30 and 5.46 Ϯ 1.36 ng/ml, respectively, for cathepsin L1; p ϭ 0.0106; and 3.53 Ϯ 3.91 and 1.90 Ϯ 2.16 ng/ml, respectively, for ISG15; p ϭ 0.0184).

DISCUSSION
Although much time and effort have been devoted to the study of molecular alterations in cancer, early detection remains one of the most promising approaches to reducing the growing cancer burden. Thus, biomarkers capable of early detection will play a key role in the management and control of most, if not all, cancers in the future (54,55). Cancer biomarkers should be measurable in bodily fluids, especially in blood samples, to allow for the screening of large populations (11,56). As numerous proteins have been found to exhibit altered levels in various cancer tissues, recent efforts have focused on compiling lists of general and cancer-specific cancer biomarker candidates, focusing on those with a higher chance of detection in bodily fluids (4,57). Thus, the present study sought to conduct an in-depth analysis of the secretomes of 23 cancer cell lines and the HPA in an effort to construct a focused data set of serological cancer biomarker candidates.

Analysis of Cancer Cell Secretomes for Biomarker Discovery
pivotal roles in tumor progression, invasion, metastasis, and/or angiogenesis by regulating cell-to-cell and cell-toextracellular matrix interactions. More importantly, these secreted proteins are likely to enter bodily fluids and could therefore be measured using non-invasive diagnostic tests. Thus, the analysis of cancer cell secretomes derived in vitro may help identify potential serum tumor biomarkers (58,59).
Although the secretomes of cell lines derived from breast, ovarian, prostate, pancreatic, and lung cancers have been profiled individually (21-24, 28 -30, 32, 51), no studies have compared the secretomes of cell lines from various cancer types or developed in-depth profiles of these proteins. We previously examined cancer cell secretomes using SDS-PAGE and MALDI-TOF MS (25). Now, the current study examines the secretomes of 23 cancer cell lines using the GeLC-MS/MS approach. We have generated, for the first time, a data set consisting of 31,180 protein identities, corresponding to 4,584 non-redundant proteins, with an approximate average of 1,300 proteins from each cell line and an error rate less than 1% (Table I and supplemental Table 1). We used this data set to compare the secretomes of cell lines from various cancer types, which were generated and analyzed using the same experimental conditions. Our analysis revealed numerous proteins that are commonly or selectively secreted by cell lines derived from different cancer types. Therefore, this valuable reference data set may facilitate the discovery of pan-cancer biomarkers and cancer type-specific biomarkers using a secretome-based strategy.
We and other groups have performed extensive studies to identify cancer-derived, body fluid-accessible marker candidates in conditioned media from various cancer cell lines. However, it has become apparent that not all proteins found in cancer cell secretomes are useful tumor-associated markers. Thus, researchers need a way to determine which candidates should be further validated as useful body fluid-accessible markers. Theoretically, the most likely candidates are proteins that are released by cancer cells and are overexpressed at a high frequency in tumor tissues. This notion is supported by recent immunohistochemical studies showing that numerous proteins are secreted by cancer cell lines and are overexpressed in cancerous tissues and that the serum and plasma levels of these proteins are indeed elevated in cancer patients (20,23,(25)(26)(27)31). These proteins include PAI-1 and Mac-2BP in patients with NPC (20), cathepsin D in patients with lung cancer (23), Mac-2BP and CRMP-2 (collapsin response mediator protein-2) in patients with CRC (25,26), Mac-2BP in patients with oral cancer (27), and ALCAM, CD44, IL1R2, MIF, and TARC in patients with Hodgkin lymphoma (31). Studies that explore the tissue expression of proteins secreted by cell lines of different cancer types may therefore allow for the efficient prioritization of targets for subsequent validation.
The HPA is the first publicly available database containing information about protein expression in standardized normal human tissues, cancers, and in vitro cultured cell lines (60,61). Much of these data have been analyzed via tissue mi- croarrays (62), and the individual images have been annotated by certified pathologists. We searched the HPA database for cancer cell-secreted/released proteins of interest using an arbitrary selection criteria (i.e. proteins expressed in more than 50% of the tumor tissue sections examined) and thereby performed an in silico validation of the expression of hundreds of target proteins in various human cancers. Ultimately, we generated a list of ϳ470 serological cancer biomarker candidates from 11 cancer types for further validation (Table V and  supplemental Table 7). In the present study, we confirmed the significantly elevated plasma levels of two such candidates (i.e. CD14 and SDF-1/CXCL12) in HCC and lung cancer patients, respectively (Fig. 6). The CD14 protein, a glycosylphosphatidylinositol-anchored glycoprotein of 50 -55 kDa, was constitutively expressed on the surface of myeloid cells and acts as a pattern recognition receptor that plays an important role in innate immunity. A second soluble form of CD14 (sCD14) lacking the glycosylphosphatidylinositol tail was abundant in serum (63,64). In addition to myeloid cells, recent studies have shown that CD14 is also expressed by many non-myeloid cells (65), including the hepatoma cell line Hep G2 (66,67). Although many studies have examined the role of CD14 in infection-or immunity-related diseases, little is known about the possible significance of this protein in the development of cancer. To our knowledge, only one previous study reported higher preoperative serum levels of sCD14 in patients with epithelial ovarian cancer than in patients with benign ovarian disease (68). The SDF-1/CXCL12 protein is a member of the chemokine family that acts as a ligand for the CXCR4 receptor and plays multiple roles in tumor pathogenesis (69,70). Although SDF-1/CXCL12 is overexpressed in numerous cancers, few studies have investigated SDF-1/ CXCL12 levels in blood samples obtained from cancer patients (71)(72)(73), and the clinical significance of SDF-1/CXCL2 blood levels in most cancers is largely unknown. Our integrative analysis of cancer cell secretomes and the HPA reveals, for the first time, that CD14 and SDF-1/CXCL12 are potential plasma makers for HCC and lung cancer, respectively.
In addition to the 469 candidates selected from 11 cancer types, the present study also highlighted ϳ80 protein candidates as pan-cancer serological marker candidates. The expression of these marker candidates in specimens from nine cancer types has been validated in the HPA (supplemental Table 8). These candidates are involved in many biological processes, including metabolic processes (mainly the glycolytic enzymes), cell motility (cytoskeleton-related regulatory proteins), protein folding (molecular chaperones), proteolytic systems (proteosome and proteases), and protein synthesis. It has long been recognized that these biological processes are dysregulated in many cancers as is the expression of proteins and enzymes involved in these processes (74). However, few studies have systemically determined whether these proteins can be secreted by different types of cancer cells and whether these proteins are potentially useful serological mark-ers for cancer. We addressed the former question and found that more than 170 proteins were released by all 23 cancer cell lines in serum-free conditions. It is possible that these proteins were detected in all cell lines because they are simply more abundant in the conditioned media. This notion is supported by our findings that most, if not all, of the 170 proteins were identified on the basis of multiple peptides and have significantly higher emPAI values than do proteins that were only detected in some cell lines (Table III and supplemental Tables 1 and 8). Another important question is whether proteins that are released by multiple cell lines and are highly expressed in most cancer types are, in fact, suitable serological pan-cancer marker candidates that should be further validated. Some of the target proteins on our list have been previously described as potential serological markers for various cancers. For example, elevated serum or plasma levels of galectin-3-binding protein (i.e. Mac-2BP) (75) have been reported in six cancer types, including breast cancer, HCC, lymphoma, NPC, CRC, and oral cancer (20, 26, 27, 76 -78). Similarly, higher serum or plasma levels of cathepsin D have been detected in nine cancer types, including breast cancer, HCC, HNC, prostate cancer, glioma, CRC, stomach cancer, pancreatic cancer, and lung cancer (23, 79 -85).
One of major challenges in the fields of tumor marker discovery and cancer biology is the formation of biological hypotheses about numerous marker candidates and the design of effective follow-up experiments (49). We evaluated our short list of secreted regulatory proteins for NPC using em-PAI-based, label-free quantification and hierarchical clustering analyses in an effort to put our secretome data into a biological context (Fig. 4 and supplemental Table 9). It should be noted that the three NPC cell lines examined in this study originated from distinct NPC types. Specifically, NPC-TW02 and NPC-TW04 were derived from a keratinizing carcinoma and undifferentiated carcinoma, respectively (86), whereas NPC-BM1 was derived from a bone marrow biopsy of a patient with nonkeratinizing NPC (87). However, the three NPC cell lines could be clustered together with the proteins listed in supplemental Table 8, suggesting that these proteins and pathways may play an important role in NPC initiation or progression. We performed an additional pathway analysis by integrating our secretome data into a cellular signaling context. This analysis suggested that cell adhesion/migration and immune system regulation are among the most differentially regulated biological processes in NPC as compared with other cancers (Fig. 5 and supplemental Fig. 4). Metastasis has occurred in 7% of NPC patients by the time of initial diagnosis, and over 20% of patients with NPC develop metastasis after treatment (88,89). In addition, immune suppression and evasion via the inactivation of tumor-infiltrating lymphocytes and imbalances in regulatory and effector T cells have been proposed in NPC patients (90,91). We selected two targets for validation (i.e. cathepsin L1 and ISG15) that are known to be involved in the regulation of the immune system (Fig. 5B).
Cathepsin L1 (also known as cathepsin L, which is distinct from cathepsin L2 and cathepsin V) is a lysosomal cysteine protease that can degrade the components of extracellular matrices and basement membranes (92). This protease is overexpressed in a variety of cancer tissues (93). Cathepsin L1 is often overexpressed in metastatic cervical lymph node samples from patients with NPC, which correlates with lymph node metastasis and distant metastasis (53). The second target for validation, ISG15 (i.e. an interferon-stimulated gene also known as ubiquitin cross-reactive protein), plays a critical role in the interferon-mediated immune response against antiviral infection (94,95). ISG15 was recently identified as a novel tumor marker candidate in bladder, breast, and oral cancers (96 -98). However, we are the first to examine the serum levels of cathepsin L1 and ISG15 in patients with NPC, and our results show that the levels of these two proteins are higher in patients with NPC than in healthy controls. Taken together, these observations suggest that an emPAI-based, label-free quantification and hierarchical clustering analyses of multiple cancer cell secretomes, in conjunction with biological pathway analyses, are an effective strategy for discovering dysregulated pathway-based serological marker candidates in cancers such as NPC.
To date, ϳ8,800 antibodies have been applied against the human proteome to examine the expression of corresponding proteins in the HPA. There are still many human proteins whose expression in cancer tissues has not been systematically examined via immunohistochemical analyses and therefore could not be analyzed using the methods described here. Expression of these proteins in cancerous versus noncancerous tissues may be alternatively evaluated via the analysis of numerous cDNA microarray data sets available in the public domain, and the results could then be integrated with the cancer cell secretome data set to identify potential serological marker candidates. The feasibility of this notion is supported by our recent identification of Mac-2BP as a CRC plasma biomarker based on its apparent secretion by CRC cell lines and elevated transcriptional levels in a public array-based analysis of CRC tissues (26). In addition, we performed an in-house cDNA microarray analysis to demonstrate that the mRNA levels of cystatin A, manganese-superoxide dismutase, and MMP2 were higher in NPC tissues than in adjacent non-tumor tissues. We furthermore showed that these proteins were released by NPC cell lines and proposed that they serve as a potential serum biomarker panel for NPC (99). As immunohistochemical staining signals can vary with the use of different antibodies, our findings suggest that the potential serological markers identified using the strategy described here might complement the list of potential markers derived from the combined analysis of cancer cell secretomes and cancer tissue transcriptomes.
In conclusion, we have performed an in-depth comparative analysis of the secretomes of 23 cancer cell lines and have integrated our data with the HPA to generate a list of potential body fluid-accessible cancer biomarker candidates. The val-idation of potential cancer biomarkers requires a large cohort of well defined, high quality clinical specimens, a process that slows the discovery of clinically useful biomarkers (54,55). Therefore, efficient strategies for narrowing down the list of proteins that are dysregulated in cancers would greatly reduce the costs of manpower, reagents, and precious clinical specimens. The potential cancer biomarkers described here could complement or serve as a useful alternative to previously constructed data sets (4,57) for the more rapid validation of clinically useful biomarkers.