Identification of Microorganisms by Liquid Chromatography-Mass Spectrometry (LC-MS1) and in Silico Peptide Mass Libraries.

Over the past decade, modern methods of MS (MS) have emerged that allow reliable, fast and cost-effective identification of pathogenic microorganisms. Although MALDI-TOF MS has already revolutionized the way microorganisms are identified, recent years have witnessed also substantial progress in the development of liquid chromatography (LC)-MS based proteomics for microbiological applications. For example, LC-tandem MS (LC-MS2) has been proposed for microbial characterization by means of multiple discriminative peptides that enable identification at the species, or sometimes at the strain level. However, such investigations can be laborious and time-consuming, especially if the experimental LC-MS2 data are tested against sequence databases covering a broad panel of different microbiological taxa. In this proof of concept study, we present an alternative bottom-up proteomics method for microbial identification. The proposed approach involves efficient extraction of proteins from cultivated microbial cells, digestion by trypsin and LC-MS measurements. Peptide masses are then extracted from MS1 data and systematically tested against an in silico library of all possible peptide mass data compiled in-house. The library has been computed from the UniProt Knowledgebase covering Swiss-Prot and TrEMBL databases and comprises more than 12,000 strain-specific in silico profiles, each containing tens of thousands of peptide mass entries. Identification analysis involves computation of score values derived from correlation coefficients between experimental and strain-specific in silico peptide mass profiles and compilation of score ranking lists. The taxonomic positions of the microbial samples are then determined by using the best-matching database entries. The suggested method is computationally efficient - less than 2 mins per sample - and has been successfully tested by a test set of 39 LC-MS1 peak lists obtained from 19 different microbial pathogens. The proposed method is rapid, simple and automatable and we foresee wide application potential for future microbiological applications.


INTRODUCTION
Rapid and reliable identification of pathogenic bacteria is of vital importance in many areas of public health and is relevant also in the food industry and for biodefense. In the context of clinical microbiology, a large variety of very different techniques, among them biochemical, serological, chemotaxonomic, and more recently spectroscopic, spectrometric and genomic tools are routinely utilized. For example, mass spectrometry-based techniques, such as matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF MS) have emerged as invaluable tools for accurate and cost-effective identification of microorganisms in today's routine clinical microbiology (1)(2)(3)(4). The MALDI-TOF MS approach allows obtaining the genus and species identity of unknown samples by matching microbial mass spectra against spectral libraries collected from microorganisms with a known taxonomic identity.
While identification is most reliably achieved at the species level, the question of whether MALDI-TOF MS is suitable for identification and discrimination below the species level is still controversially discussed by the scientific community (4)(5)(6)(7).
While a large number of studies convincingly demonstrate successful discrimination and identification of pathogenic bacteria by MALDI-TOF MS at the species level, there is also ample evidence for limitations of the taxonomic resolution, particularly at the infraspecies level and when dealing with differentiation of genetically closely related species (8)(9)(10)(11). For example, differentiation between Escherichia coli and Further limiting factors of the MALDI-TOF MS method are the relatively low resolution, resulting in decreased selectivity and a reduced dynamic sensitivity, i.e. a lowered detectability of protein signals over a wide concentration range (11).
In contrast to MALDI-TOF MS, liquid chromatography-tandem MS (LC-MS 2 ) generally detects large numbers of signals at very high resolution with very high mass accuracy in a single run (11). Shotgun proteomic methods observe proteolytic cleavage products, often tryptic peptides, instead of intact proteins. This enables MS data collection with high analytical sensitivity. Moreover, coupling of mass spectrometry with chromatographic separation (LC) has shown to increase the dynamic sensitivity and allows sensitive detection also of low abundant peptides. Finally, LC-MS 2 is much less restricted to classes of proteins with specific physicochemical properties. Even though proteomic techniques are still complex, rather cost-intensive and limited for use by well-equipped laboratories, the many advantages of LC-MS have led to an increasing number of activities aiming at evaluating potential applications of LC-MS in microbiology (16)(17)(18)(19)(20).
Various groups have used shotgun proteomics for the classification and identification of pathogenic microorganisms. For example, a proteomics-based workflow for bacterial identification has been suggested by Dworzanski which involved construction of a bacterial proteome database from bacterial genomes, LC-MS 2 data acquisition from digested bacterial cell extracts, identification of tryptic peptides and sequence-to-bacterium assignments (21). The approach has been later utilized to determine the relatedness among strains of B. cereus sensu stricto, B. thuringiensis and B. anthracis by estimating fractions of shared peptides derived from a prototype database (22). LC-MS 2 has been also used by Tracz and coworkers to identify Biosafety Level 3 bacteria (19). Sequence data from tryptic microbial peptides were obtained and employed for Mascot searches against a database containing concatenated protein sequences derived from microbial genomes. Identification of bacterial species was carried out by by guest on October 4, 2020 summing up matches from unique and degenerated (shared) peptides found per concatenated sequence; a post-culture analysis time of less than 8 hours has been reported.
Another alternative based on LC-MS has been proposed by Jabbour. Bacterial samples were lysed and subjected to tryptic digestion followed by LC-MS 2 (16). Subsequently, peptides were identified and matched against databases. Bacteria were then identified based on the assessment of unique peptides obtained by an algorithm called BACid.
Comparison between microbial protein sequence data obtained by bottom-up tandem MS and reference databases was also performed by Boulund and colleagues. The proposed analysis pipeline (TCUP) not only returned specific genes of reference genomes that matched with peptide sequences determined by LC-MS 2 , but provided also the relative abundances of individual bacteria identified in a given mixed culture (23). In this way, TCUP allowed typing and characterizing pathogenic bacteria from pure cultures and to estimate the relative abundances of individual microbial species from mixed microbial samples (23). In the same year Berendsen and co-workers suggested a generic LC-MS 2 method for the identification of microorganisms from positive blood cultures (18). A LC-MS compatible sample preparation method was developed that enabled accurate identification of bacteria grown in blood culture flasks to species level based on LC-MS 2 bottom-up proteomics, database searches and matching with taxon-specific discriminative peptides.
Advantages of the LC-MS 2 -based approaches for bacterial identification outlined above are the excellent accuracy of identification, high taxonomic resolution, universal applicability to the ever-growing numbers of known microbes and the ability to identify bacteria from mixtures, e.g. in polymicrobial infections. At the other hand the comparatively high computational requirements have to be mentioned. Since the time required for peptide identification correlates with the number of entries contained in sequence databases, computation time can be saved by restricting the size of the database, for example by using genus-specific databases. However, database restrictions contradict the use of shotgun proteomics as an by guest on October 4, 2020 unbiased approach for microbial identification. Another important limitation of LC-MS 2 -based approaches is the severely reduced accuracy and sensitivity of common search algorithms when extensive protein sequence databases are used. The large search space impedes the identification of true peptide matches within large numbers of similar sequences.
With this proof-of-concept study, we introduce an alternative, easy-to use and computational less demanding approach for microbial identification. The proposed method is based on bottom-up proteomics as the analytical technique and involves acquisition of LC-MS data from pure microbial cultures. MS 1 data are extracted and tested against a database compiled in-house using public protein databases (UniProtKB) with currently more than 12,000 strain-specific in silico mass profiles. We demonstrate that the MS 1 information can be used for rapid and accurate taxonomic identification, at least at the species level, and discuss possibilities to combine the suggested analysis pipeline with known MS 2 -based analysis methods in microbiology.

Microbial strains:
The performance and accuracy of the proposed method for microbial identification was tested using 19 well-characterized bacterial strains which were predominantly obtained from established strain collections such as DSM (Deutsche Sammlung von Mikroorganismen), ATCC (American Type Culture Collection) and NCTC (National Collection of Type Cultures). Strains E 125, E 131 and E153 of Burkholderia thailandensis originated from the strain collection at the Robert Koch-Institute (RKI) (24).
An overview of the microbial strains and species utilized is given in Tab

Experimental design and statistical rationale
In this proof of concept study we tested the proposed MS1-based identification workflow by proteomic data from 19 different bacterial strains from which 39 RAW files were collected. The Burkholderia subset (see below and supporting information) included biological and technical replicate spectra. LC-MS measurements were shuffled in most cases in such a way, that technical replicates of the same sample were not measured consecutively. The Mathworks, Natick, USA). As part of the parseuniprot toolbox (see below) this function supports import of peptide feature text files obtained from LC-MS 1 data and performs data pre-processing, including molecular weight (MW) determination by considering charge states, detecting and removing peak entries originating from oxidized peptides (mass shift +15.99491 Da) as well as from peptides with deamidated glutamine or asparagines residues (mass shift +0.98402 Da). Spectral pre-processing involved furthermore partially removing (underweighting) of low intensity and low MW peaks; based on the principle that the relevance of a specific peak for subsequent identification analysis co-varies with its intensity and MW values (see below). As the result of pre-processing, experimental LC-MS 1 data of a by guest on October 4, 2020 given sample is collapsed into a single MS 1 mass peak list, which contains the filtered peptide mass data.

Sample preparation by suspension trapping (STrap
Such data are in the following referred to as experimental LC-MS 1 peak lists, or -after conversion into continuous spectra -as LC-MS 1 test spectra. For identification analyses LC-MS 1 peak lists (see above) were first imported via the muf data format that is specific for MicrobeMS (30). Inter-spectral distances between experimental LC-MS 1 peak lists and in silico peptide mass profiles were then obtained utilizing the function compare with DB of MicrobeMS.

Compilation of the in silico peptide
Bar coded MS 1 test spectra were constructed from LC-MS 1 peak lists using MW bins of a relative width of 1.2 ppm. As distance metrics, variance-scaled Pearson's product momentum correlation coefficients (Pareto scaling 0.25) were selected, whereby data between 2000 and 5500 Da served as inputs. The calibration range factor was set to a value of 2 giving the total number of calibration factor variations of 125, see MicrobeMS wiki for details (30). In MicrobeMS correlation coefficient-based inter-spectral distances are converted to score values between 0 and 1000. This is achieved on the basis of linear scaling, whereby a score of 1000 can only be achieved if the LC-MS 1 test and a given in silico database profile match entirely (identity). A score of zero, on the other hand, is obtained only when any correlation is absent.
In MicrobeMS the score values determined between a test spectrum and the strain-specific in silico mass profiles are arranged in a ranking list and the best matching database entries are used to determine the taxonomic identity of the strain investigated. This approach is not new and has been used for many years in infrared, Raman, or MALDI-TOF MS identification software solutions such as Bio-Rad's KnowItAll, Bruker's MALDI Biotyper, or Biomerieux's Saramis / VITEK MS. In the current implementation of MicrobeMS, the score ranking list is provided in a HTML format where the top 30 best-matching in silico database records are displayed for each LC-MS 1 test spectrum analyzed (30), see also supporting information, SI. by guest on October 4, 2020

Overview of the identification analysis workflow:
In this proof of concept study we present a computational pipeline which is suitable for identification of pathogenic microorganisms from bottomup mass spectrometry (MS) data. An overview of the proposed sequence of analysis steps is presented in UniProtKB/Swiss-Prot library with reviewed and manually annotated proteins and the UniProtKB/TrEMBL proteome data (unreviewed, computationally analyzed proteins). Both mutually exclusive databases were merged and the protein sequence information and scattered metadata were extracted, further processed, re-sorted and stored in a format the microbial identification software can read. The specified data analysis steps were carried out by means of the parseuniprot toolbox, a Matlab-based analysis pipeline specifically developed at RKI. In addition to supplementary command-line Matlab functions, e.g.
for merging the UniProtKB/Swiss-Prot and UniProtKB/TrEMBL databases or for creating and modifying taxonomy white lists (see below), the parseuniprot toolbox comes with three major functions, (i) by guest on October 4, 2020 readdat, (ii) resort and (iii) modfeat which are consecutively executed and whose analysis steps build upon each other. Fig. 2 schematically illustrates the sequence of the analysis steps readdat, resort and modfeat. The first function parseuniprot/readdat has been employed for data import from a UniProtKB file format and for generating a first Matlab structure array. In this array each protein is mapped to an array element whereby each element comprises specific fields that contain protein sequence and metadata as well as protein, proteome and taxonomic identifiers (IDs). In order to reduce the overall data amount and to exclude entries with unclear taxonomic assignment from further analyses parseuniprot/readdat has also support for "whitelisting" organisms. Matlab's Bioinformatics Toolbox, PTM profiling and MW determination routines had to be re-written for performance reasons and, after careful testing, integrated into the parseuniprot toolbox. All calculated MW values of tryptic peptides were then re-indexed according to proteome IDs. As a result, each proteome or taxon endpoint in question was associated with the respective peptide entries which permitted compiling strain-specific peptide mass lists after sorting, filtering and cleansing. Peak lists are then converted to bar-coded spectra, or profiles using MW bins of a predefined relative width. Spectra are extracted for all non-empty bins of the first in silico profile. Frequencies values are ranked in descending order and MW bin values of the first profile above a certain frequency threshold are set to zero. The procedure has to be repeated for each profile of the in silico database. Tests revealed that the overall accuracy of identification increased if peptide MW data above the 90th frequency percentile are disregarded from further analyses. In this way, ~10% of the MW data with presumably non-specific information are removed from the in silico database. The adjusted in silico spectra and selected metadata were subsequently stored in the pkf file format, see above.
For the future, it is planned to test and implement more advanced feature selection approaches. Genetic algorithms, for example, could be advantageously applied to identify combinations of taxon-specific peptide markers and could thus contribute to improve the accuracy of microorganism identification. Fig.   S1 shows a screenshot with the graphical user interface of the current version of the parseuniprot toolbox.
Spectral pre-processing: Pre-processing of raw experimental data aims at increasing the robustness and accuracy of subsequent quantitative or classification analysis (32,33). In the context of the present study, the strategy of pre-processing LC-MS 1 spectra was inspired by the following ideas: Firstly, the number of experimentally determined MS 1 peaks usually varied between 60,000 and 90,000 per strain. It is reasonable to assume that relevant fractions of these peaks carry non-specific information with some of them arising from chemically modified peptides. For the sake of simplicity, we have proposed that the intensity of such peaks is lower on average compared to intensities from unmodified peptides. Thus, underweighting low intensity features from MS 1 spectra was assumed to have a positive impact on the accuracy of identification. Secondly, high MW peptides are thought to be somewhat more specific with regard to pathogen identification than short peptides with a lower MW. An important objective of data pre-processing was therefore to eliminate low intensity peaks in low MW regions at a higher rate than in high MW regions. Thirdly, peaks from chemically modified peptides, i.e. from oxidized or deamidated species, are not represented in the in silico database and should be thus identified and removed from experimental data.
The results of pre-processing MS 1 data are exemplarily illustrated in Fig. 3 Shigella. This observation is supported by taxonomy data. For example, it has been stated that Shigella strains can be viewed from a genetic perspective as subpopulations within E. coli (12,41,42) and some studies even recommend re-classification of Shigella and E. coli, e.g. (42,43).
True misidentification was observed only in a single instance, see sample #14. In this example Yersinia pseudotuberculosis DSM 8992 has been identified as Yersinia pestis SCPM-O-DNA-17 (I-2457, top hit); strains of Y. pseudotuberculosis were, however, ranked at positions 2 and 3. As in the previous example, the analysis of the taxonomy is helpful to understand the particular test result. On a genomic level Y.
pestis is known to be highly similar to the enteric pathogen Y. pseudotuberculosis (44,45). In fact, Y.
pestis can be considered a clone of Y. pseudotuberculosis which has evolved only recently (45,46). Moreover, the top-scored strain of Y. pestis is a member of the subspecies Y. pestis ssp. ulegeica (47).
Strains of Y. pestis ssp. ulegeica belong to a branch which is from a phylogenetic point of view more closely related to the ancestor Y. pseudotuberculosis than members from other branches of Y. pestis, including recent strains from Y. pestis ssp. pestis (44,48). Therefore, it can be stated that higher score values were determined for SPEED samples, whereby identification accuracy is thought to benefit only slightly from this advantage. However, it should be pointed out that the data set used is rather small, so a more comprehensive test set would be required to support further conclusions. by guest on October 4, 2020

DISCUSSION
In this proof-of-concept study, we evaluated the principal applicability of shotgun LC -mass spectrometry (LC-MS 1 ) for microbial identification and explored the taxonomic resolution of the proposed method. To this end, a database of strain-specific in silico peptide MW profiles was constructed from UniProtKB resources and queried by experimental LC-MS 1 test spectra obtained from microorganisms grown in pure cultures. The results of these queries are summarized in score ranking lists which were helpful to obtain insights into the taxonomic identity of the bacteria studied. It can be stated that the suggested approach is generally (38 of 39 cases) suitable for identifying bacteria at the genus and species level, and sometimes even at strain level. Despite these encouraging results, it should be also noted that in a single instance an ambiguous identification result was found. Targeted screening for combinations of taxon-specific peptide features (feature selection) and a better quality of the underlying protein sequence database, particularly of UniProtKB/TrEMBL (database curation), are suggested as potential starting points to further improve the accuracy of the proposed workflow.  The significantly higher number of signals present in LC-MS data constitutes another important difference to MALDI-TOF MS. While MALDI-TOF MS usually detects limited numbers (< 150) of predominantly high-abundance proteins with housekeeping functions, such as basic ribosomal proteins, or nucleic acid-binding proteins (50)(51)(52)(53), usually in the m/z region between 2 and 20 kDa, LC-MS shotgun proteomics usually detects more than 50,000 peptides from a single microbial sample. However, further studies are required to clarify whether the enormous increase of information contained in the LC-MS 1 data will indeed lead to a better taxonomic resolution.
Current drawbacks of the LC-MS method are, above all, higher instrument costs and a relatively low dissemination of LC-MS equipment and analysis concepts in clinical or food microbiology laboratories.
However, it is to be expected that future technological developments will help to reduce experimental efforts and it is anticipated that dedicated LC-MS 1 systems could be helpful to reduce costs and thus improve dissemination of LC-MS technology.
Compared to published LC-MS 2 methods to identify and classify microbial pathogens, the proposed approach offers a number of important advantages. Firstly, collection of high-quality LC-MS 1 peptide data can be done with considerable less time efforts compared to LC-MS 2 proteomics measurements. In the present study, however, the experimental data originate from 120 or 160 min gradient LC-MS 2 measurements carried out within the context of other project objectives. However, it is known from a large number of published studies and from own experience that LC-MS 1 measurements of microbial samples can be performed within 10 minutes and provide sufficiently large numbers of peptide signals.
This fact and the possibility to fully automate sample preparation by the SPEED sample preparation protocol points towards a better sample throughput of the LC-MS 1 method compared to LC-MS 2 approaches and could even rival with the speed of MALDI biotyping.
Secondly, the computational requirements are significantly reduced due to the fact that identification of peptides and/or proteins is not necessary. Under our conditions computational time requirements were by guest on October 4, 2020 negligible (less than 2 minutes). Thirdly, the method does not rely on specifically identifying, i.e. discriminating ("unique") peptides. This is important because the number of such peptides tends to diminish with the ever-growing number of peptides contained in future database versions. Furthermore, the simplicity of the proposed microbial identification method allows defining scores in a straightforward manner and to adapt a well-established and even FDA-approved principle in the field of MS-based microbial diagnostics (score ranking lists). An in-depth literature search and internal discussions led us to the conclusion that score definition from LC-MS 2 data by using unique peptides constitutes a rather complex problem for which no universally accepted solution has yet been presented.
Major disadvantages compared to LC-MS 2 -based identification are the need to work with microbial cells from pure cultures due to the higher requirements of sample purity. Although polymicrobial samples were not tested by us, it is reasonable to assume that the presented spectra correlation analysis approach is not applicable to the study of other than pure samples. the one hand and peptide identification analyses by the help of MS 2 data, ideally from order, family, or genus specific sequence databases at the other hand, could be helpful to reduce analysis times and to improve the accuracy of microbial identification as a whole.

CONCLUSIONS
This proof-of-concept study has demonstrated that identification analysis from LC-MS 1 data represents a powerful technology that could drive improvements in bacterial identification. The technique utilizes in silico libraries generated from publicly available proteome resources and does not require databases of experimental mass spectra. The proposed pipeline is easy to use, computationally efficient and freely available for both Linux and Windows operating systems. The taxonomic resolution of the method is promising, but improvements, such as well-curated databases, application of feature selection methods, better quality checks as well as rigorously conducted tests with large LC-MS 1 data sets are needed to answer the question whether the suggested approach can be employed in clinical microbiology in a reliable, effective and useful manner.

ACKNOWLEDGEMENTS
The authors are thankful to Max Weydmann for conducting LC-MS measurements of the microbial samples prepared by the STrap method.

DATA AVAILABILITY
The mass spectrometry proteomics data were published under the Creative Commons Attribution Non

Figure 1.
Overview of the proposed LC-MS 1 based microbial identification workflow. Pure microbial cultures are prepared and colony material is processed using established sample preparation protocols for shotgun proteomics. Mass spectrometry data are then obtained using LC-MS. MS 1 data are extracted and preprocessed for subsequent comparison against a library of in silico mass profiles obtained from UniProtKB/Swiss-Prot and UniProtKB/TrEMBL protein sequence data. This library is composed of MW pattern, or profiles, each representing a characteristic strain-specific combination of peptide masses whereby peptides may be specific or non-specific in a MS 2 context. A ranking list of correlation, or interspectral distance values (i.e. of scores) is established which provides information on the taxonomic identity of the organism studied. Schematic workflow for generating an in silico database from UniProtKB/Swiss-Prot and/or UniProtKB/TrEMBL protein sequence data. The Matlab toolbox parseuniprot represents a proteomic pipeline in which three main internal functions, readdat, resort and modfeat are consecutively executed. The function readdat converts the content from structured text files available from ftp://ftp.uniprot.org into Matlab structure arrays that contain the complete information required to compile the in silico databases. Such arrays are subsequently processed by the functions resort and modfeat; the output of the parseuniprot pipeline is a collection of strain-specific in silico peptide mass profiles suitable for computer-based comparison (pattern matching) with experimental LC-MS 1 test spectra. Pre-processing and feature selection of LC-MS 1 data. MS 1 peak data were acquired from a culture of Enterococcus faecalis DSM 20371; sample preparation has been carried according to the SPEED sample preparation protocol. (27) Top row: histogram bar chart of log 10 scaled MS 1 peak intensities (left) and the molecular weight (MW) distribution (right) of peaks after feature detection by the Minora algorithm (=original* data, blue bars) and after pre-processing and feature selection by readlcmstxtfile (processed data, red bars).
Total number of peaks in original / processed MS 1 data: 82843 / 42559 Number of oxidized / deamidated peptides found and removed: 389 / 329 Lower row: ratio between the number of peaks present in processed and in original MS 1 data as a function of peak intensity (log 10 scaled, left), or of the MW (right).
Pre-processing was carried out by readlcmstxtfile, a Matlab function developed in house. This function has been designed to preferentially remove low intensity signals in the low MW region (< 2000 Da). The blue shaded area between 2000 -5500 Da indicates the MW range used for correlation analysis by MicrobeMS.

Figure 4.
Data analysis workflow for microbial identification based on experimental LC-MS 1 data and in silico databases comprising strain-specific peptide mass profiles derived from microbial genomes.