Numerical compression schemes for proteomics mass spectrometry data

The open XML format mzML, used for representation of MS data, is pivotal for the development of platform-independent MS analysis software. Although conversion from vendor formats to mzML must take place on a platform on which the vendor libraries are available (i.e. Windows), once mzML files have been generated, they can be used on any platform. However, the mzML format has turned out to be less efficient than vendor formats. In many cases, the naïve mzML representation is fourfold or even up to 18-fold larger compared with the original vendor file. In disk I/O limited setups, a larger data file also leads to longer processing times, which is a problem given the data production rates of modern mass spectrometers. In an attempt to reduce this problem, we here present a family of numerical compression algorithms called MS-Numpress, intended for efficient compression of MS data. To facilitate ease of adoption, the algorithms target the binary data in the mzML standard, and support in main proteomics tools is already available. Using a test set of 10 representative MS data files we demonstrate typical file size decreases of 90% when combined with traditional compression, as well as read time decreases of up to 50%. It is envisaged that these improvements will be beneficial for data handling within the MS community.

can be used on any platform. However, the mzML format has turned out to be less efficient than vendor formats. In many cases, the naïve mzML representation is 4--fold or even up to 18--fold larger compared to the original vendor file. In disk I/O limited setups, a larger data file also leads to longer processing times, which is a problem given the data production rates of modern mass spectrometers. In an attempt to reduce this problem, we here present a family of numerical compression algorithms called MS--Numpress, intended for efficient compression of MS data. To facilitate ease of adoption, the algorithms target the binary data in the mzML standard, and support in main proteomics tools is already available. Using a test set of 10 representative MS data files we demonstrate typical file size decreases of 90% when combined with traditional compression, as well as read time decreases of up to 50%. It is envisaged that these improvements will be beneficial for data handling within the MS community.

INTRODUCTION
Open XML formats for representation of MS data have been developed by the proteomics community to facilitate exchange and vendor neutral analysis of mass spectrometry data. Initially two formats, mzXML (1) and mzData (http://psidev.info/), existed in parallel, until these formats were merged into the single standard format mzML (2). The mzML format has been adopted widely by the proteomics community and is supported by many data processing tools. However, although successfully used in many pipelines, the mzML format has not reached its full usage potential, mainly due to large file sizes in comparison to the raw vendor formats. The file size problem has become more marked with the introduction of recent high--resolution high--frequency mass spectrometers. As an example, a raw data file from a data--independent acquisition experiment using an AB SCIEX TripleTOF resulted in a vendor format data file of 2.5 GB. Conversion of this file to standard mzML resulted in a 46.7 GB file, with a conversion time of about 12 min on a desktop computer (later called high--end I, Fig.  1A) dedicated to the conversion process. If the file is compressed using gzip to lower the storage footprint, the size drops to 21.6 GB, but the conversion now takes 2 hours instead. This  (5), where known non--metabolite data points are discarded. Outside the MS community, potential benefits could come from recent work in the numerical computation field (6, 7), were many data types are similar to MS data in terms of precision and smoothness. Nevertheless, perhaps the most relevant recent advance is the emergence of the mz5 format (8), which yields performance increases via a binary representation and optimized libraries, as well as some regular data compression. While this format, based on the open HDF5 standard (The HDF Group, Champaign, IL), is an efficient representation of mzML files, it suffers from the fact that the files are not readable without native libraries or specialized software, which, to some extent, has hampered its uptake.
Also, while mz5 can be "lossless", default compression implies removal of zero intensity scans, which means the original data cannot be reconstructed, and some algorithms require zero intensity scans for correct functioning.
The standard XML representation used in mzML can be easily viewed as text on any operating system, and it is relatively easy to write a parser in any programming language. We thus sought to overcome the mzML efficiency shortcomings by introducing better compression of the binary data found in mzML files while still leaving the metadata in XML format, and propose such an extension to the format here. Furthermore, we envisage that decompression of this binary data should be easy to incorporate into software tools via permissively licensed stand--alone source code files for C++ and Java, which do not require any external dependencies. We here also exemplify the facility of usage by implementing support in several popular tools for proteomics data analysis.

EXPERIMENTAL PROCEDURES AND RESULTS
To efficiently compress the three main types of binary data present in mzML files: i) mass to charge ratios, ii) ion counts and iii) retentions times, we have developed three new near--lossless compression algorithms, while ensuring for each data type that precision losses are well below the precision of the most advanced mass spectrometers of today. The Numpress Linear Prediction Compression algorithm (hereafter called numLin, relative error < 2e--9) takes advantage of the linearly increasing values in m/z and retention time data, and is optimized for high--resolution m/z data. Ion count data does not linearly increase but requires less stored precision because of the lower instrument precision, and Numpress Short Logged Float (numSlof, relative error < 2e--4) is optimized for this data type. We also developed a second ion count compression (Numpress Positive Integer Count, numPic) and a lossless transformation (Numpress Linear Prediction Transformation, numSafe), which are not used further here, but are presented in the supplementary materials.
Although the least significant of the 16 double--precision decimals are lost in the first conversion to the compressed format for all the algorithms, compression and decompression after this does not incur further losses. To maximize speed, the algorithms are highly local in memory and only need a single traversal of the data. For a complete description of the algorithms we refer to Supplemental Methods, and to the reference implementations in Java and C++, found at https://github.com/ms-numpress/ms--numpress under the Apache 2.0 license.
To compare MS--Numpress to current alternatives for storing mzML data, we extensively evaluated size, write time and read time of available compression schemes on a varied set of data files using different computers (Fig. 1). For this we constructed a test set of 10 MS data files from different vendors, instruments, and experiment types ( Fig.  1D and Suppl. Table  1). The test set files included data--dependent acquisition (DDA), selected reaction monitoring (SRM) and data--independent (DIA/SWATH) acquisition modes, and both simple and complex samples, giving a heterogeneous set of distributions of MS1 and MS2 spectrum data and chromatogram binary data arrays of different lengths (Suppl. Fig. 1). These files were converted to mzML, imzML (9) and mz5 (8), both without compression, using zlib compression, and using gzip of the entire file. Files were also compressed in multiple different setups using MS--Numpress compressions, resulting in a total of 18 tested compression schemes (Fig 1C and Suppl. Table 2). To avoid clutter, minor results are left out here, and readers interested in imzML--data or individual Numpress results are referred to the supplementary material. The different compression schemes were compared based on file size, read time and write time (Fig. 1B). Benchmarking was performed on four dedicated desktop computers of varying capacity (Fig. 1A), using a custom msconvert (10) build, and timed using a script written in Python. Write time was measured as the total time for an msconvert conversion from the vendor raw format. Since this includes the vendor read time it gives a constant offset, but this constant is in general small compared to the write time. For read benchmarking a custom program was made, that reads files using the ProteoWizard (10) API. To ensure that all data is read, this program explicitly reads all binary values in the spectra and chromatograms found in the file. Test files, results, and program binaries are available at the Swestore repository (http://webdav.swegrid.se/snic/bils/lu_proteomics/pub).
The use of near--lossless compression introduces the question of whether one can be sure that no analytically relevant data is lost. We measured the relative errors for the compressed versions of all the files in the test set (Suppl. Table 3 We found that to achieve minimal file size, a combination of numLin for m/z or retention time data and numSlof for ion count data (numAll), with subsequent gzipping of the entire file, was the most optimal in terms of file size. This yielded an average file size reduction of 87% compared to standard mzML across all 10 test set files ( Fig. 2A), with 138% longer write times ( Fig. 2B) but 21% shorter read times (Fig.  2C) on average across all tested computers and files (Suppl. Table  7). This format is also half the size of the binary mz5 with zlib compression (mz5zlib), and also smaller than all vendor formats except for AB SCIEX's .wiff files (Suppl. Fig. 2, Suppl. Table 7). In our read speed tests, the text based mzML formats cannot quite compete with the binary mz5, although the difference is small (20%) in the largest files (Fig. 2C). The effects of minimizing disk I/O through compression are the most visible in the largest files on the lower performance computers (Suppl. Fig. 3 and 4), where the numAll alternatives catch up to mz5zlib. Overall, read times ranged over 3 orders of magnitude (Suppl. Fig. 5), and write times over 4 orders of magnitude (Suppl. Fig. 6).
Two of the test computers were equipped with SSD hard drives, which are fast enough to open up the disk I/O bottleneck, and reveal the next bottleneck: processor speed. On these machines the expensiveness of gzipping becomes apparent, with cost increasing with file size (Fig. 2B), and the largest gains from the fast disk I/O are seen for the processor--light schemes mzML, mz5zlib and numAll (Suppl. Fig.  7  and  8). Zlib compression of individual data arrays also shows minor slow--down of the write operation, whereas the numAll scheme does not affect write times at all (Fig. 2C).
Even though numAll decreased read times by on average 36% compared to standard mzML (Suppl. file, text--based format, both 1) and 2) could be improved in optimized reader implementations.
As the MS--Numpress compression techniques showed high degrees of compression, which was our primary goal, we set out to implement the technique as part of several proteomics pipelines in order to ensure easy adoption by the proteomics community. The initial implementation in ProteoWizard (10) enables conversion to the format from all major mass spectrometer raw data formats, and provides read access to tools that use the ProteoWizard API for reading files, for example MyriMatch (12) and Skyline (13).
Compression should be especially effective for workflows that use high--resolution profile data for quantitation, because of the large data files this implies, and we therefore implemented support for reading of mzML files with numpress compressed binaries in OpenMS (14), msInspect (11) and the Proteios Software Environment (15,16). The implementation in OpenMS also implies that the complete MS--Numpress compression and decompression algorithms are directly available in the Python scripting language due to the recent Python--wrapping of the complete OpenMS library (17).
The jmzML (18)  Our results demonstrate the power of some very simple techniques to improve the mzML format with respect to disk space and handling time. There are undoubtedly other algorithms that could further improve on the degree of compression or handling times, but for standard formats we believe it is crucial to provide simple and robust solutions, to minimize both the cost of implementation in tools, and the risk of mistakes in the algorithm. We further provide implemented support for the new algorithms in several tools, and thus give immediate access to the proteomics community. MS--Numpress will also be evaluated through the Proteomics Standards Initiative (PSI) process for formal inclusion in the next mzML release. We hope that this work may also stimulate additional data compression algorithm ideas, which, in the end, will lead to an amendment to the mzML standard to improve data handling for all mzML users. Figure 5 and 6) are provided.

Figure 2: File size, read and write time compared to standard mzML. A) Standard box plot of file size relative to mzML for 7 data formats. B--C) log 2 write time (B) and read time (C) subtracted by log 2 mzML write and read time, respectively, for the 10 test files and 7 data formats. Test files are sorted in descending mzML size, and dots represent measurements on individual data files on one of two SSD--hard drive equipped computers. Writing of vendor formats could not be tested with this setup and is thus missing in B. Missing read times are due to errors in execution (mzML.gz for file 5 and mz5zlib for file 6) as discussed in the supplementary material, where also global statistics for all compression schemes (Suppl. Table 7) and absolute timing figures (Suppl.
1

Algorithm descriptions
The library provides implementations of 4 different algorithms, 1 designed to compress first order smooth data like retention time or M/Z arrays, 1 designed to transform first order smooth data for more efficient zlib compression, and 2 for compressing non-smooth data with lower requirements on precision like ion count arrays.
Implementations and unit test are provided in C++ and java.

MS Numpress positive integer compression (numPic)
Intended for ion count data, this compression simply rounds values to the nearest integer, and stores 2 these integers in a truncated form which is effective for values close to zero.

MS Numpress short logged float compression (numSlof)
Also targeting ion count data, this compression takes the natural logarithm of values, multiplies by a scaling factor and rounds to the nearest integer. For typical ion count dynamic range these values fits into two byte integers, so only the two least significant bytes of the integer are stored.
The scaling factor can be chosen manually, but the library also contains a function for retrieving the optimal numSlof scaling factor for a given data array. In this case optimal refers to the greatest precision storable for this data in 2 bytes. Since the scaling factor is variable, it is stored as a regular double precision float first in the encoding, and automatically parsed during decoding.

MS Numpress linear prediction compression (numLin)
This compression uses a fixed point representation, achieved by multiplication by a scaling factor and rounding to the nearest integer. To exploit the frequent linearity of the data, linear prediction is then used in the following way.
The first two values are stored without compression as 4 byte integers. For each following value a linear prediction is made from the two previous values: Xpred = (X(n) -X(n-1)) + X(n) Xres = Xpred -X(n+1) The residual Xres is then stored, using the same truncated integer representation as in Numpress Pic.
The scaling factor can be chosen manually, but the library also contains a function for retrieving the optimal numLin scaling factor for a given data array. Again, optimal here refers to the greatest precision for the fixed point byte size. Since the scaling factor is variable, it is stored as a regular double precision float first in the encoding, and automatically parsed during decoding.

MS Numpress safe linear prediction (numSafe)
This transformation uses the same linear prediction as numLin, but without the fixed point representation or integer truncation. This means that no compression is achieved and the resulting binary array will be exactly the same size as the input array. Note that even so some minimal degradation will occur due to the double operation rounding errors, but as sequential compression and decompression is hardly performed the transformation should still be practically lossless.

Truncated integer representation
This encoding works on a 4 byte integer, by truncating initial zeros or ones. If the initial (most significant) half byte is 0x0 or 0xf, the number of such halfbytes starting from the most significant is stored in a count halfbyte. This initial count is then followed by the rest of the int's halfbytes, in littleendian order. A count halfbyte c of 0 <= c <= 8 is interpreted as an initial c 0x0 halfbytes 9 <= c <= 15 is interpreted as an initial (c-8) 0xf halfbytes 3 Examples: int c rest 0 => 0x8 -1 => 0xf 0xf 23 => 0x6 0x7 0x1

Implementation notes
We recommend to simply embed the numpress library source files in your source when implementing numpress support in new tools. At the point of writing, implementations so far are all open source, which means there are many reference implementations, especially for C++. For the time being, we also recommend numpress writer implementations to produce mzML 1.1 compliant files, meaning amongst other things that only one compression per binary should be allowed (uncompressed, zlib, numPic, numSlof or numLin), and the 32/64-bit tag written out even though it's unnecessary with numpress compressions.
During this project we also implemented support for reading and writing imzML with ProteoWizard, in an attempt to investigate eventual gains with not having to encode all data in base64 binary data. While there is the obvious gain in file size from the reduced redundancy, we did not see any conclusive improvements in handling speed using our implementation. Nevertheless, we are in contact with the ProteoWizard team to eventually include this imzML-support in ProteoWizard. It should also be noted that imzML was simply used as a means for storing binary data in an external binary file, and we have not added any handling of imaging relevant information.

Tools and scripts used for testing
For testing we have extended ProteoWizards msconvert. Added abilities include supporting numpress compressions, allowing both numpress and zlib compression on the same binary data array, writing and reading imzML, as well as explicitly setting the write buffer size. At the time of writing, numpress support and double compression is already included in the official msconvert, and inclusion of other amendments is discussed. Access to exact binaries and scripts used for testing will be provided upon request.

Missing values
We were unable to achieve a few measurements. Because of their proprietary nature we cannot write custom vendor files. Reading the largest SWATH DIA file in the mzML.gz format crashed on all computers in a failure to allocate memory, even on the 24 GB machine. Reading of the extracted chromatograms (file 6) crashed for unknown reasons when trying to read the mz5 and mz5zlib formats. Unpromising and likely incorrect preliminary results for the imzML-based formats stopped completion of read and write timing for these formats. Thermo no Orbitrap Velos 1) The large size difference comes from the very different samples, were 4) is an information sparse dilution of stable isotope peptides in water, and 5) is a information dense yeast extract (Suppl. Fig. 1). 2) Chromatograms were extracted using a custom extraction tool, by deconvolution using a top-hat filter of 10 ppm total width. This means peaks within 10 ppm of a known fragment m/z, in MS2 spectra of the swath corresponding to the known precursor m/z, were summed with weights decreasing linearly with distance from the exact fragment mass. 3) This is in mzML format since we cannot write custom Thermo raw files.