Advertisement

MARMoSET – Extracting Publication-ready Mass Spectrometry Metadata from RAW Files

  • Marina Kiweler
    Affiliations
    Bioinformatics Core Unit, Max Planck Institute for Heart and Lung Research, W.G. Kerckhoff Institute, Ludwigstr. 43, Bad Nauheim, Germany
    Search for articles by this author
  • Mario Looso
    Correspondence
    To whom correspondence may be addressed
    Affiliations
    Bioinformatics Core Unit, Max Planck Institute for Heart and Lung Research, W.G. Kerckhoff Institute, Ludwigstr. 43, Bad Nauheim, Germany

    The German Center for Cardiovascular Research (DZHK), Partner Site Rhine-Main
    Search for articles by this author
  • Johannes Graumann
    Correspondence
    To whom correspondence may be addressed
    Affiliations
    Scientific Service Group Biomolecular Mass Spectrometry, Max Planck Institute for Heart and Lung Research, W.G. Kerckhoff Institute, Ludwigstr. 43, Bad Nauheim, Germany

    The German Center for Cardiovascular Research (DZHK), Partner Site Rhine-Main
    Search for articles by this author
Open AccessPublished:May 16, 2019DOI:https://doi.org/10.1074/mcp.TIR119.001505
      In the context of publishing data sets acquired by mass spectrometry or works based on such molecular screens, metadata documenting the instrument settings are of central importance to the evaluation and reproduction of results. A single experiment may be linked to hundreds of data acquisitions, which are frequently stored in proprietary file formats. Together with community-, repository-, as well as publisher-specific reporting standards, this state of affairs frequently leads to manual —and thus error prone—metadata extraction and formatting. Data extracted from a single file also often stand in for an entire file set, implying a risk for unreported parameter divergence. To support quality control and data reporting, the C# application MARMoSET extracts and reduces publication relevant metadata from Thermo Fischer Scientific RAW files. It is integrated with an R package for easy reporting. The tool is expected to be particularly useful to high throughput environments such as service facilities with large project numbers and/or sizes.

      Graphical Abstract

      Aiming for evaluability and reproducibility of mass spectrometry based research and in parallel to the maturation of the field toward the acquisition of ever larger data sets, initiatives to standardize reporting of instrument settings and other relevant metadata have arisen in the community itself (
      • Taylor C.F.
      • Binz P.-A.
      • Aebersold R.
      • Affolter M.
      • Barkovich R.
      • Deutsch E.W.
      • Horn D.M.
      • Hühmer A.
      • Kussmann M.
      • Lilley K.
      • Macht M.
      • Mann M.
      • Müller D.
      • Neubert T.A.
      • Nickson J.
      • Patterson S.D.
      • Raso R.
      • Resing K.
      • Seymour S.L.
      • Tsugita A.
      • Xenarios I.
      • Zeng R.
      • Julian Jr, R.K.
      Guidelines for reporting the use of mass spectrometry in proteomics.
      ,
      • Taylor C.F.
      • Paton N.W.
      • Lilley K.S.
      • Binz P.-A.
      • Julian Jr,
      • Jones R.K.A.R.
      • Zhu W.
      • Apweiler R.
      • Aebersold R.
      • Deutsch E.W.
      • Dunn M.J.
      • Heck A.J.R.
      • Leitner A.
      • Macht M.
      • Mann M.
      • Martens L.
      • Neubert T.A.
      • Patterson S.D.
      • Ping P.
      • Seymour S.L.
      • Souda P.
      • Tsugita A.
      • Vandekerckhove J.
      • Vondriska T.M.
      • Whitelegge J.P.
      • Wilkins M.R.
      • Xenarios I.
      • Yates III, J.R
      • Hermjakob H.
      The minimum information about a proteomics experiment (MIAPE).
      ), from the deposition requirements of public data repositories (
      • Vizcaíno J.A.
      • Deutsch E.W.
      • Wang R.
      • Csordas A.
      • Reisinger F.
      • Ríos D.
      • Dianes J.A.
      • Sun Z.
      • Farrah T.
      • Bandeira N.
      • Binz P.-A.
      • Xenarios I.
      • Eisenacher M.
      • Mayer G.
      • Gatto L.
      • Campos A.
      • Chalkley R.J.
      • Kraus H.-J.
      • Albar J.P.
      • Martinez-Bartolomé S.
      • Apweiler R.
      • Omenn G.S.
      • Martens L.
      • Jones A.R.
      • Hermjakob H.
      ProteomeXchange provides globally coordinated proteomics data submission and dissemination.
      ,
      • Jones P.
      • Côté R.
      ), as well as have been launched by publishers and editors involved in the dissemination of mass spectrometric experiments (
      • Bradshaw R.A.
      • Burlingame A.L.
      • Carr S.
      • Aebersold R.
      Reporting protein identification data: the next generation of guidelines.
      ,
      • American Chemical Society
      ). The extraction and reporting of the metadata required remains a tedious process, especially given the facts that OMICS experiments frequently involve hundreds of data files and that data often resides in binary and/or proprietary file formats offering an excellent information to storage space ratio yet limiting ease of access.
      One such example is the RAW file format produced by Thermo Fischer Scientific's (Bremen, Germany) mass spectrometers. Beyond the acquired spectral data, RAW files also contain instrument settings as metadata, which are required to evaluate and reproduce the results.
      An obvious and common approach to extract this data is to manually open individual RAW files using the vendor-specific Xcalibur software and copy the required information. The tediousness and error prone nature of manual interaction with the individual file, however, frequently leads to the extraction of metadata describing an entire data set from a single file. When the data set encompasses hundreds of files acquired over a potentially long period, this implies a potential for undetected parameter drift with implications for laboratory-internal quality control, reporting and publication. In this context core facility laboratories carry a particularly large burden, as the sheer number of projects they handle further compounds the data access problem. In combination with publication requiring metadata reporting often years removed from data delivery to customers, the difficulty to extract parameters frequently implies “data archeology” from deep archive.
      To the best of our knowledge no software exists to date to address the need for both simple reporting from large numbers of RAW files and metadata reduction to a consensus set of parameters. Using the vendor-provided application programming interface (API)
      The abbreviation used is:
      API
      application programming interface.
      1The abbreviation used is:API
      application programming interface.
      RawFileReader () we create such a tool along with R (
      • RDevelopment Core Team
      ) based infrastructure for the generation of tabular representations suitable for intra-laboratory quality control, data reporting, as well as supplemental material in publications.

      MATERIALS AND METHODS

      System Requirements

      The C# (
      • Microsoft Corporation ed
      ) command line tool MARMoSET is running as 64-bit code on Microsoft Windows only. It was compiled in Visual Studio Community (Version 15.8.7, .net 4.7.03056) (

      .Visual Studio IDE Visual Studio, https://visualstudio.microsoft.com/vs/ (accessed Jan 31, 2019).

      ) for the .NET Framework 4.6.1. The accompanying R package is agnostic with respect to operating system and only requires a functional R installation, as well as the package dependences assertive (
      • Cotton R.
      ), jsonlite (
      • Ooms J.
      • Lang D.T.
      • Hilaiel L.
      ), pathological (
      • Cotton R.
      • Thyson J.
      ), Rlist (
      • Ren K.
      ), stringi (
      • Gagolewski M.
      • Tartanus B.
      ), and magrittr (
      • Bache S.M.
      • Wickham H.
      ).

      Implementation

      The C# (
      • Microsoft Corporation ed
      ) application MARMoSET (https://github.molgen.mpg.de/loosolab/MARMoSET_C) extracts publication relevant metadata from Thermo Fischer Scientific RAW files as a JSON (
      • Trachsel C.
      • Panse C.
      • Kockmann T.
      • Wolski W.E.
      • Grossmann J.
      • Schlapbach R.
      rawDiag: An R Package Supporting Rational LC-MS Method Optimization for Bottom-up Proteomics.
      ) file.
      The RAW file format as accessed through the RawFileReader API provides multiple levels of metadata dependent on the system it was acquired on. A fixed header contains information like date, original filename, and sample information. The header is followed by a list containing instrument modules used and their respective methods as strings. The API additionally provides separate entry points for detector-associated data (such as ultraviolet spectrophotometry or mass spectrometry). MARMoSET currently only implements access to MS data, which beyond the acquired spectral data includes further instrument parameters and logs. Structure, string formatting, as well as location in the data structure of relevant metadata are specific for instrument classes. Using the “IRaw DataPlus” interface from the RawFileReader API, MARMoSET collects all relevant metadata from the divergent data structure.
      In the context of liquid chromatography/mass spectrometry (LC/MS) using EASY-nLC ultra high-pressure liquid chromatography instruments (Thermo Fisher Scientific), LC parameters are available in the method strings and extracted and parsed by MARMoSET. Where chromatography instrument parameters beyond the EASY-nLC series are retrievable from the RAW file, future inclusion into the reporting infrastructure is expected to be straight forward and will be added as encountered.
      Depending on whether it is provided with the path to a single RAW file or a directory, MARMoSET either acts on a single file or iterates over a collection of RAW files in a directory making use of parallelization dependent on the hardware resources available. In a first step, the metadata is gathered for each RAW file individually. In order to reduce data from many files into a minimal set of parameters describing the entire collection, the resulting data structures are hash code evaluated and sorted in a dictionary. This information is then used to sort RAW files into groups that share all relevant parameters. Finally, a JSON file is written, comprising a minimal set of parameter groups linked to the corresponding RAW file names (see Fig. 1).
      Figure thumbnail gr1
      Fig. 1Schematic representation of data processing by MARMoSET including excerpts of the data representations involved.
      To provide easy and intuitive handling of the structured data in the JSON file, we provide an R (
      • RDevelopment Core Team
      ) package also named MARMoSET (https://github.molgen.mpg.de/loosolab/MARMoSET). It is designed to create compact and clear tables following predefined journal requirements such as MIAPE or JPR. Moreover, it supports individual selections of parameter sets in a few easy steps.
      On Microsoft Windows, the included function “generate_json()” directly runs the C# command line tool from within R and captures its content into memory. Alternatively, externally generated JSON files may be read as well. In-memory data is reformatted by list flattening into a “data.frame” (function “flatten_json()”).Based on a term-matching Table included in the package, the function “match_terms()” extracts and arranges a subset of parameters from the total metadata set in the flattened JSON file for all given parameter groups and generates a table reduced to journal specific requirements. These tables may then be exported ready to upload in the form of tab delimited txt files and MS excel tables by using the “save_all_groups()” function. The included term-matching table can easily be adapted by the user to include further metadata entities or to design individual reporting styles. It is noteworthy that the same toolkit employed here has also been choosen by Trachsel et al (2018) (
      • Trachsel C.
      • Panse C.
      • Kockmann T.
      • Wolski W.E.
      • Grossmann J.
      • Schlapbach R.
      rawDiag: An R Package Supporting Rational LC-MS Method Optimization for Bottom-up Proteomics.
      ) to facilitate analysis of spectrum-level metadata.

      Exemplary Workflow (Windows)

      In a first step and from within a functional R environment (see https://cloud.r-project.org/ for installation instructions), the MARMoSET R package is installed from the github repository using the package remotes (
      • Bray T.
      ), which may be achieved by calling “install.packages(“remotes”); remotes::install_github(“loosolab/MARMoSET,” host = “github.molgen.mpg.de/api/v3”).” After loading MARMoSET with “library(MARMoSET),” a JSON object containing the metadata of grouped RAW files may be created by executing “json <- generate_json(“<PATH-TO-RAW-FILES-DIRECTORY>”)” and prepared for downstream processing using “flat_json <- flatten_json(json),” A reporting guideline- (“MIAPE” in this example) and instrument-specific filter is generated by calling “term_matching_table <- create_term_matching_table(instrument_list = c(“Thermo EASY-nLC”, “Q Exactive - Orbitrap_MS”), origin_key = “miape”)” and applied to the JSON object using “vector_of_group_tables <- match_terms(flat_json, term_matching_table).” Finally, a tab delimited text file representation may be saved using “save_all_groups(vector_of_group_tables, output-path).”
      Further use cases and more detailed instructions may be found on the top-level README page of the github repository (https://github.molgen.mpg.de/loosolab/MARMoSET), as well as through R's help system (e.g. by calling ‘?generate_json‘) (
      • Hester J.
      • Csárdi G.
      • Wickham H.
      • Chang W.
      • Morgan M.
      • Tenenbaum D.
      ).

      RESULTS AND DISCUSSION

      A combination of an ever-increasing number of raw data files per mass spectrometry based experiment and a strong push for the standardized reporting of metadata for evaluation and reproducibility has rendered metadata extraction and its reduction into the smallest common parameter set a common need within the community. To the best of our knowledge, however, no tool simplifying metadata extraction currently exists. With the MARMoSET suite of tools presented here we fill that gap for mass spectrometric data acquired by Thermo Fisher Scientific's instruments, combining outputs geared toward machine readability (JSON), as well as human consumption (tab delimited text and MS excel). The resulting information is suited for documentation and reporting, publication, as well as operations oversight and expected to be particularly helpful in the context of environments implying high throughput acquisition of mass spectrometric data, such as core facility laboratories.
      In conclusion, MARMoSET offers tools for the simple extraction of metadata from RAW files. With the intend to particularly serve high throughput data acquisition environments, the tool enables the straightforward generation of small and clear tables containing just the metadata or parameter information needed. MARMoSET is designed for flexible adaption on individual laboratory's needs.

      REFERENCES

        • Taylor C.F.
        • Binz P.-A.
        • Aebersold R.
        • Affolter M.
        • Barkovich R.
        • Deutsch E.W.
        • Horn D.M.
        • Hühmer A.
        • Kussmann M.
        • Lilley K.
        • Macht M.
        • Mann M.
        • Müller D.
        • Neubert T.A.
        • Nickson J.
        • Patterson S.D.
        • Raso R.
        • Resing K.
        • Seymour S.L.
        • Tsugita A.
        • Xenarios I.
        • Zeng R.
        • Julian Jr, R.K.
        Guidelines for reporting the use of mass spectrometry in proteomics.
        Nat. Biotechnol. 2008; 26: 860
        • Taylor C.F.
        • Paton N.W.
        • Lilley K.S.
        • Binz P.-A.
        • Julian Jr,
        • Jones R.K.A.R.
        • Zhu W.
        • Apweiler R.
        • Aebersold R.
        • Deutsch E.W.
        • Dunn M.J.
        • Heck A.J.R.
        • Leitner A.
        • Macht M.
        • Mann M.
        • Martens L.
        • Neubert T.A.
        • Patterson S.D.
        • Ping P.
        • Seymour S.L.
        • Souda P.
        • Tsugita A.
        • Vandekerckhove J.
        • Vondriska T.M.
        • Whitelegge J.P.
        • Wilkins M.R.
        • Xenarios I.
        • Yates III, J.R
        • Hermjakob H.
        The minimum information about a proteomics experiment (MIAPE).
        Nat. Biotechnol. 2007; 25: 887
        • Vizcaíno J.A.
        • Deutsch E.W.
        • Wang R.
        • Csordas A.
        • Reisinger F.
        • Ríos D.
        • Dianes J.A.
        • Sun Z.
        • Farrah T.
        • Bandeira N.
        • Binz P.-A.
        • Xenarios I.
        • Eisenacher M.
        • Mayer G.
        • Gatto L.
        • Campos A.
        • Chalkley R.J.
        • Kraus H.-J.
        • Albar J.P.
        • Martinez-Bartolomé S.
        • Apweiler R.
        • Omenn G.S.
        • Martens L.
        • Jones A.R.
        • Hermjakob H.
        ProteomeXchange provides globally coordinated proteomics data submission and dissemination.
        Nat. Biotechnol. 2014; 32: 223
        • Jones P.
        • Côté R.
        Thompson J.D. Ueffing M. Schaeffer-Reiss C. Functional Proteomics: Methods and Protocols, Methods in Molecular Biology. Humana Press, Totowa, NJ2008: 287-303
        • Bradshaw R.A.
        • Burlingame A.L.
        • Carr S.
        • Aebersold R.
        Reporting protein identification data: the next generation of guidelines.
        Mol. Cell. Proteomics. 2006; 5: 787-788
        • American Chemical Society
        Guidelines for Reporting Proteomic Experiments Using Mass Spectrometry. American Chemical Society, Washington, D.C.2019
        • Thermo Fisher :: Orbitrap :: RawFileReader
        (accessed Jan 22, 2019)
        • RDevelopment Core Team
        R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria2008
        • Microsoft Corporation ed
        Microsoft C# language specifications. Microsoft Press, Redmond, WA2001
      1. .Visual Studio IDE Visual Studio, https://visualstudio.microsoft.com/vs/ (accessed Jan 31, 2019).

        • Cotton R.
        assertive: Readable Check Functions to Ensure Code Integrity. 2016;
        • Ooms J.
        • Lang D.T.
        • Hilaiel L.
        jsonlite: A Robust, High Performance JSON Parser and Generator for R. 2018;
        • Cotton R.
        • Thyson J.
        pathological: Path Manipulation Utilities. 2017;
        • Ren K.
        rlist: A Toolbox for Non-Tabular Data Manipulation. 2016;
        • Gagolewski M.
        • Tartanus B.
        stringi: Character String Processing Facilities. 2018;
        • Bache S.M.
        • Wickham H.
        magrittr: A Forward-Pipe Operator for R. 2014; (R package version 1.5). https://CRAN.R-project.org/package=magrittr
        • Trachsel C.
        • Panse C.
        • Kockmann T.
        • Wolski W.E.
        • Grossmann J.
        • Schlapbach R.
        rawDiag: An R Package Supporting Rational LC-MS Method Optimization for Bottom-up Proteomics.
        J Proteome Res. 2018; 17: 2908-2914
        • Bray T.
        The JavaScript Object Notation (JSON) Data Interchange Format. 2017;
        • Hester J.
        • Csárdi G.
        • Wickham H.
        • Chang W.
        • Morgan M.
        • Tenenbaum D.
        remotes: R Package Installation from Remote Repositories, Including ‘GitHub.‘. 2019;