Highly Efficient Classification and Identification of Human Pathogenic Bacteria by MALDI-TOF MS*S

Accurate and rapid identification of pathogenic microorganisms is of critical importance in disease treatment and public health. Conventional work flows are time-consuming, and procedures are multifaceted. MS can be an alternative but is limited by low efficiency for amino acid sequencing as well as low reproducibility for spectrum fingerprinting. We systematically analyzed the feasibility of applying MS for rapid and accurate bacterial identification. Directly applying bacterial colonies without further protein extraction to MALDI-TOF MS analysis revealed rich peak contents and high reproducibility. The MS spectra derived from 57 isolates comprising six human pathogenic bacterial species were analyzed using both unsupervised hierarchical clustering and supervised model construction via the Genetic Algorithm. Hierarchical clustering analysis categorized the spectra into six groups precisely corresponding to the six bacterial species. Precise classification was also maintained in an independently prepared set of bacteria even when the numbers of m/z values were reduced to six. In parallel, classification models were constructed via Genetic Algorithm analysis. A model containing 18 m/z values accurately classified independently prepared bacteria and identified those species originally not used for model construction. Moreover bacteria fewer than 104 cells and different species in bacterial mixtures were identified using the classification model approach. In conclusion, the application of MALDI-TOF MS in combination with a suitable model construction provides a highly accurate method for bacterial classification and identification. The approach can identify bacteria with low abundance even in mixed flora, suggesting that a rapid and accurate bacterial identification using MS techniques even before culture can be attained in the near future.

Currently the most popular methods for bacterial identification are based on microbiologic procedures, antibody recognition, and PCR amplification. Traditionally microbiologic methods are culture-based assays that examine the presence of bacterial species. These methods provide high sensitivity and specificity, but their efficiency is limited by the complexity of the procedures, including culture, selection, isolation, and morphologic and biochemical characterization, which usually take 48 h or longer. Serologic methods are presumptive and confined to the availability of antibodies and to bacteria that are included ahead in the assays. Molecular biology techniques, particularly PCR, have been regarded as non-culturebased methods with high efficiency and specificity (1). However, they are completely dependent on the known genetic sequences of the target bacteria.
MS with its capability of de novo protein/peptide sequencing (such as electrospray ionization or MALDI-TOF MS for tandem MS/MS) or its high efficiency for proteome profiling (particularly MALDI-TOF MS) has been suggested as an alternative for microbial identification (2)(3)(4)(5)(6)(7). In the past decade, extraction of bacterial proteins for sequencing using tandem MS/MS (8 -11) or for proteome profiling followed by matching MS spectrum results to databases (fingerprinting) have been used for bacterial identification (6,8,(12)(13)(14)(15). Despite much improvement (10, 11, 16 -20), neither de novo amino acid sequencing nor protein fingerprinting has been applied to clinical or epidemiologic uses because they are relatively time-consuming and technique-demanding or have low reproducibility (16,19). Further evaluation of the feasibility of applying MALDI-TOF MS for rapid and accurate microorganism identification is warranted.
In this study, we systematically evaluated procedures for the rapid profiling and data analysis for bacterial classification and identification. We aimed to demonstrate that directly subjecting intact bacterial colonies for protein profiling using MALDI-TOF MS can be a simple and reliable approach (21)(22)(23)(24)(25)(26) and that bacteria of independently prepared groups can be accurately classified and identified by two analytic approaches: the unsupervised (hierarchical clustering analysis) and supervised (Genetic Algorithm) (27) approaches. We further intended to demonstrate the capability to identify bacteria in the minimal number of cells and in mixed flora.

EXPERIMENTAL PROCEDURES
Collection and Isolation of Pure Bacterial Colonies-All bacterial isolates used in this study were collected and characterized by the Clinical Pathology Laboratory of Chang Gung Memorial Hospital, Taoyuan, Taiwan. Bacterial species were characterized using the standard microbiological protocols (28). Colonies were cultured on sheep's blood agar plates at 37°C for 48 h followed by 24-h incubation at room temperature. To verify the presence of vegetative bacterial cells, the plates were examined under a phase-contrast microscope. The cells were harvested from blood agar plates by scraping, and the intact single colony was transferred onto a polished ground steel target plate for further MALDI-TOF MS analysis (Bruker Daltonics, Bremen, Germany).
Three independently prepared sets of bacterial isolates were used. Initially 57 well characterized bacterial isolates (training set) composed of the six most common species of human pathogenic bacteria, namely Staphylococcus aureus (10 isolates), Streptococcus serogroup B (eight isolates), Escherichia coli (eight isolates), Klebsiella pneumoniae (10 isolates), Salmonella enterica serogroup B (11 isolates), and Pseudomonas aeruginosa (10 isolates), were used for both unsupervised hierarchical clustering analysis (HCA) 1 and supervised analysis (see below) for construction of the classification models.
The second set contained 37 well characterized bacterial isolates composed of the same bacterial species as the first set and was used for external validation of the constructed classification models (independent set 2). The third set of 38 bacterial isolates, including 14 isolates other than the six species used for the model construction, was used for further evaluation of the classification models (independent set 3).
Mass Spectrometry-Each colony that had been spotted on target plates was overlaid with 1 l of matrix solution containing 1.5 mg of ␣-cyano-4-hydroxycinnamic acid in 50% acetonitrile with 2.5% TFA. When the matrix for MALDI-TOF MS was changed to sinapic acid or 2,5-dihydroxybenzoic acid or the preparation of solvents for the matrix was changed, the resulting MS spectra were affected. As such, we used the protocol described here for matrix preparation consistently throughout this study.
Positive ion mass spectra in the linear mode were obtained on an Ultraflex TOF/TOF mass spectrometer (Bruker Daltonics) after the sample was dried at room temperature. A 337 nm nitrogen laser irradiated and ionized the samples at a shot rate of 25 Hz. Each spectrum had a summation of 500 laser shots with a mass range of 1000 -25,000 Da controlled by FlexControl acquisition software (Version 2.4, Bruker Daltonics). Mass calibration was achieved using a mixture of standard peptides and proteins (angiotensin II, ACTH, insulin, and myoglobin) to maintain mass accuracy better than 500 ppm. System reproducibility was confirmed by applying the two randomly selected bacterial isolates to 10 different wells on the same plate followed by MALDI-TOF MS analysis using the same protocols as described above.
Data Analysis-To analyze the mass spectra patterns of bacteria, ClinProTools software (Version 2.0.365, Bruker Daltonics) was first used for peak definition (signal to noise ratio Ͼ3), integration (end point level), mass recalibration (maximal peak shift of 500 ppm), area normalization (against total ion count), and statistical analysis (Wilcoxon/Kruskal-Wallis test). Each corresponding peak throughout the spectra within each studied set was carefully inspected. The peak list combined with normalized areas was exported to Gene Cluster (Version 3.0, Human Genome Center, University of Tokyo) for unsupervised HCA. The input data were first normalized to have a mean of 0 and standard deviation of 1, and then HCA was performed using the correlation similarity metric and average linkage method.
To reduce classifier complexity, we selected m/z values with p value less than 10 Ϫ10 (one-way ANOVA) across all species and subjected them to HCA. To further simplify the classifier, we used the K-nearest neighbor text categorization (KNN) method to select classifiers and validate their recognition rates by the leave-one-out method for cross-validation as well as an independent group of bacterial isolates (independent set 2).
For rapid classification, we implemented the Quick Classifier (QC), Support Vector Machine (SVM), and Genetic Algorithm (GA) embedded in the ClinProTools software for model generation. In the initial model construction stage, 57 clinical isolates of bacteria, the training set, were used to establish models as well as perform cross-validation procedures. Identification was defined as 100% matched to the species-specific m/z values in the model.
The performance of the classification models was evaluated by recognition capability (RC) and cross-validation achievement (CVA): RC ϭ TP/n where TP is the number of true positives (correctly classified) in a data set and n is the number of samples in a data set, and CVA ϭ TPЈ/nЈ where TPЈ is the number of true positives (correctly classified) in a 20% left-out data set and nЈ is the number of samples in a 20% left-out data set. RC is essential to internally evaluate the fitness of the classification models, whereas CVA is crucial to measure robustness of the resulting classification models.
For external validation, another set containing 37 characterized isolates was analyzed in a blind study (independent set 2) to examine the robustness of the models. Lastly 38 uncharacterized isolates (independent set 3) from the clinical specimens were used to evaluate the performance of the rapid classification method. The results were then compared with those determined by the conventional microbiologic procedures. The positive predictive value (PPV) and negative predictive value (NPV) were used to evaluate the performance and reliability of the models: PPV ϭ TP/(TP ϩ FP) where TP is the number of true positives (correctly classified) in a data set and FP is the number of false positives (misclassified), and NPV ϭ TN/(TN ϩ FN) where TN is the number of true negatives (correctly excluded from those species used for the model construction) in a data set and FN is the number of false negatives (misclassified).
Determination of Detection Limits-Bacteria (E. coli and S. aureus) were first cultured in regular LB medium for 24 h before counting the cell numbers. Aliquots were serially diluted 10 times with 5% glucose solution to determine the limit of detection. The number of bacteria was determined by colony formation on cultured dishes after overnight growth. One microliter of each aliquot sample was subsequently spotted onto MALDI target plates for analysis.

Sample Preparation and System
Reproducibility-Several methods to prepare bacterial proteins for protein profiling by MALDI-TOF MS, including extraction of total cellular proteins, fractionation by differential affinity, and isolation of subpopulations using microbead-based chromatographic methods (C 8 and IMAC-copper, respectively) (29), were used but ended with low peak contents and low reproducibility of MS spectra (data not shown). In contrast, directly subjecting the bacterial colonies without further extraction to MALDI-TOF MS analysis resulted in rich peak contents of the spectra and the highest 1 The abbreviations used are: HCA, hierarchical clustering analysis; KNN, K-nearest neighbor text categorization method; QC, Quick Classifier; SVM, Support Vector Machine; GA, Genetic Algorithm; RC, recognition capability; CVA, cross-validation achievement; PPV, positive predictive value; NPV, negative predictive value; ANOVA, analysis of variance; ACTH, adrenocorticotropic hormone.
reproducibility (data not shown). Therefore, the latter approach was used to generate MS spectra for analysis in subsequent studies. System reproducibility was confirmed by randomly selecting two bacterial isolates for MALDI-TOF MS analysis (data not shown).
Work Flow of the Study- Fig. 1 illustrates the work flow of this study. Initially 57 well characterized bacterial isolates composed of the six most common human pathogenic bacterial species (training set) were used to examine the feasibility of bacterial classification and identification based on the m/z spectra generated by MALDI-TOF MS. Two approaches were applied: unsupervised hierarchical clustering with subsequent classifier selection and direct model construction using the supervised methods such as Genetic Algorithm. The performance and reliability of the constructed models were evaluated by cross-validation and external validation using two additional sets of independently prepared bacterial isolates (independent set 2 and independent set 3).
Hierarchical Clustering Analysis-We used an unsupervised, two-dimensional HCA to cluster bacterial isolates on the basis of similarity in their MS spectral patterns among the 57 isolates. Remarkably HCA clearly identified six major classes that corresponded exactly to the six bacterial species (Fig. 2A).
We then reduced the number of m/z values for classification. We selected 35 m/z values that best defined each individual class of bacteria (p Ͻ 1 ϫ 10 Ϫ10 via one-way ANOVA) that were then used for HCA. As illustrated in Fig. 2B, the 57 isolates were classified into six major groups corresponding exactly to the six bacterial species. Because the expression patterns of the 35 m/z values across all isolates clustered into six groups (Fig. 2B), the number of m/z values as classifiers was further reduced, and the effects were investigated. Indeed in subsequent analysis, the number of MS peaks was reduced to six via the KNN method with minimal adverse effects on the accuracy of classification (Fig. 3, A and B) and was further validated using an independent set of 38 bacterial isolates (independent set 2) (Fig. 3C).
Construction of Classification Models Based on the Known Bacterial Species-The MS spectra of the 57 bacterial isolates composed of the six most common species pathogenic to humans were used to construct the classification models ( Fig. 4 and supplemental Fig. S1). Several statistical approaches including the QC, SVM, and GA were used. A series of models were constructed, and representative ones including their RC and CVA are listed in Table I. Of these, model 7 had 100% RC and 99% CVA.
Validation Using Separately Prepared Bacterial Isolates-To evaluate the accuracy of bacterial classification and identification with the established models, one separately prepared group (independent set 2) was subjected to analysis (Table I). Of these, model 7 had a 100% PPV (Table I).
Model 7, containing 18 m/z values (Table II), was then selected for evaluation of its performance and reliability in bacterial identification using the third group of 38 isolates (independent set 3), including 14 isolates of bacteria other than the species used for the model setup. As shown in Table  III, except for one isolate of Candida albicans, which was misclassified as K. pneumoniae, the remaining 13 outsider isolates were excluded from the six bacterial species, and the remaining 24 isolates were correctly classified, presenting a 97.4% PPV (37 of 38) and a 92.9% NPV.
Determining the Limits of the Assay-We did a series of bacterial dilutions to examine the minimum number of bacterial cells needed for identification using the rapid protein pro-filing protocol and model 7 as the classifier. Two bacterial isolates, S. aureus and E. coli, were used. The minimum number for correct identification was determined to be 5.8 ϫ 10 Ϫ3 for E. coli and 5.5 ϫ 10 Ϫ3 for S. aureus. Fig. 5 shows the representative results for E. coli.
On the other hand, we examined the capability of identifying bacteria from bacterial mixtures. We mixed an equal amount of different species of colonies as follows: E. coli and P. aeruginosa; E. coli, P. aeruginosa, and K. pneumoniae; E. coli, P. aeruginosa, K. pneumoniae, and S. aureus; E. coli, P. aeruginosa, K. pneumoniae, S. aureus, and S. enterica; and E. coli, P. aeruginosa, K. pneumoniae, S. aureus, S. enterica, and Streptococcus serogroup B. All of the bacterial species in the mixture were identified in the mixtures containing all of the six species (Fig. 6). Detection was maintained as the relative amount of different species with up to 4-fold difference (data not shown).  (Table I) with the exception of the cross-validation achievement in which 20% of the data were used for evaluation. specificity, simple preparation procedures, and little dependence on knowledge of the pathogen to meet the requirements of speed and accuracy. These demands have increased abruptly, particularly after the bioterrorist attacks of spreading anthrax-laced letters in the United States in 2001 as well as the emergence of new human pathogens, such as severe acute respiratory syndrome in 2003 (30,31). In this study, we systematically evaluated the feasibility of applying mass spectrometry techniques to meet these demands.
First we searched for the best protocols of protein preparation for a rapid and accurate identification of bacteria using MS techniques. We compared different protocols and found that directly subjecting bacterial colonies without further protein extraction to MALDI-TOF MS analysis provided the richest m/z contents with high reproducibility. Consistent findings have been reported before (16,17,21,(32)(33)(34)(35)(36)(37)(38).
Second we provided several lines of evidence to support the hypothesis that MALDI-TOF MS techniques can be an accurate method for bacterial classification and identification. We used two approaches: an unsupervised method using hierarchical clustering analysis in conjunction with ANOVA and KNN for classifier selection and a supervised method of direct model construction based on known bacterial species. Both of the approaches demonstrated the accuracy of bacterial classification and identification in two independently prepared sets of bacterial isolates (containing 37 and 38 isolates, respectively). Indeed application of MALDI-TOF MS for rapid bacterial identification has been reported before (7,23,24,39,40). A database of more than 3500 MALDI-TOF MS spectra with multiple bacterial strain entries from most bacterial species has been established and used for a rapid screening and characterization of bacteria implicated in human diseases using spectrum fingerprinting analysis. However, the success of identification against this database ranged between 33 and 100% (24). The low percentage results were attributed to poor representation of some species within the database (24).
Herein we demonstrated the precise classification of bacteria among six different species simultaneously. Moreover we demonstrated that independently prepared bacteria can be correctly classified and identified by the classification models composed of only 18 or fewer m/z values. Notably bacteria other than those used for model construction were also discriminated by the classification models. These findings are significant because the less complex the classifiers, the less influence from interlaboratory variations and the more feasible the approaches are for clinical and epidemiologic uses (16,36).
Third we examined the sensitivity of our approaches for bacterial identification. We regularly applied a small portion of a single colony to each assay for MS profiling, which contained about 5 ϫ 10 6 cells. Through serial dilution, we found that 5 ϫ 10 3 was the minimum number of cells for bacterial identification in a pure strain. In mixtures (containing three strains from different species), 3 ϫ 10 4 was the minimum number of cells for identification. Apparently detection sensitivity can be further increased by improving MS instrumentation, database wealth, and analysis methods. Our findings are clinically important because a 1 ϫ 10 5 cells/ml bacterial count in body fluids (other than plasma) is generally regarded as the minimum bacterial count with clinical significance of infection; this is higher than the detection limits of our assays. Our findings therefore suggest that microorganisms obtained from clinical specimens other than blood samples can be subjected to MS analysis using only separation and concentration of microorganisms from tissue components without need for further amplification by culture. Fourth we tested the capability of identifying bacteria in mixtures containing multiple species. All previous studies on the application of MS techniques used pure colonies for bacterial identification by MS techniques. Thus, basic knowledge of the target microorganisms and conventional procedures of bacterial isolation, selection, and culture procedures were still required. Wahl et al. (41) reported the use of fingerprinting to identify a single bacterial species in mixtures. However, fingerprinting analysis was dependent on the generation of new fingerprints of the bacterial mixtures, which could not be obtained simply by combining individual fingerprints from da- tabanks because of the phenomena of ion suppression/interference during MS analysis. In this study, we mixed different bacterial species for MALDI-TOF MS, used the classification models, and found that all individual species could be identified in mixed flora containing up to six species. In contrast to the fingerprinting, the classification models containing selected m/z values not only preserved the performance of bacterial classification and reduced the effects from run-torun or potential interlaboratory variations but also provided the capability to identify different bacterial species in mixed flora even after using bacteria other than those used for the model construction. Our findings suggest the potential for microorganism identification directly from clinical samples without culture by MS techniques in the future (19). Interestingly some of the proteins of model 7 were identified as ribosomal proteins, such as m/z 4436 (50 S ribosomal protein L36 of P. aeruginosa), 4451 (50 S ribosomal protein L36 of S. enterica), and 5212 (50 S ribosomal protein L34 of P. aeruginosa). Indeed there have been proposals to utilize ribosomal proteins as markers for bacterial identification due to their high relative abundance and similarity in copy number per cell (42,43). Further works to examine whether ribosomal proteins can be utilized as the second step markers for further confirming the bacteria, which have initially been identified by the methods provided in this study, are warranted.
In summary, we systematically dissected the feasibility of applying MS techniques for rapid and accurate bacterial classification/identification. We found that directly applying a bacterial colony to MALDI-TOF MS is a simple and reliable method for rapid protein profiling. The selection of a panel of m/z values instead of the whole spectra not only reduces the dependence on spectrum consistency but also achieves a higher bacterial identification rate. Different bacterial species in mixed flora can be identified with a detection limit lower than that regarded as clinically significant for infection. Our findings support the hypothesis that mass spectrometry techniques can be an alternative for a highly efficient and accurate bacterial identification. It is hoped that, by further improving instrument sensitivity, database affluence, and analysis methods, a rapid and accurate identification of human pathogenic microorganisms without culture will be possible in the near future.
* This work was supported by Research Grant CMRPG340611-12 from Chang Gung Memorial Hospital. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.