Improving Statistical Certainty of Glycosylation Similarity between Influenza A Virus Variants Using Data-Independent Acquisition Mass Spectrometry

Amino acid sequences of immunodominant domains of hemagglutinin (HA) on the surface of influenza A virus (IAV) evolve rapidly, producing viral variants. HA mediates receptor recognition, binding and cell entry, and serves as the target for IAV vaccines. Glycosylation, a post-translational modification that places large branched polysaccharide molecules on proteins, can modulate the function of HA and shield antigenic regions allowing for viral evasion from immune responses. Our previous work showed that subtle changes in the HA protein sequence can have a measurable change in glycosylation. Thus, being able to quantitatively measure glycosylation changes in variants is critical for understanding how HA function may change throughout viral evolution. Moreover, understanding quantitatively how the choice of viral expression systems affects glycosylation can help in the process of vaccine design and manufacture. Although IAV vaccines are most commonly expressed in chicken eggs, cell-based vaccines have many advantages, and the adoption of more cell-based vaccines would be an important step in mitigating seasonal influenza and protecting against future pandemics. Here, we have investigated the use of data-independent acquisition (DIA) mass spectrometry for quantitative glycoproteomics. We found that DIA improved the sensitivity of glycopeptide detection for four variants of A/Switzerland/9715293/2013 (H3N2): WT and mutant, each expressed in embryonated chicken eggs and Madin–Darby canine kidney cells. We used the Tanimoto similarity metric to quantify changes in glycosylation between WT and mutant and between egg-expressed and cell-expressed virus. Our DIA site-specific glycosylation similarity comparison of WT and mutant expressed in eggs confirmed our previous analysis while achieving greater depth of coverage. We found that sequence variations and changing viral expression systems affected distinct glycosylation sites of HA. Our methods can be applied to track glycosylation changes in circulating IAV variants to bolster genomic surveillance already being done, for a more complete understanding of IAV evolution.


Introduction
Influenza A virus (IAV) undergoes antigenic drift and shift, respectively bringing about evasion of immunity by seasonal strains and the emergence of novel strains against which humans have little immunity (1,2). The need for pandemic preparedness has never been clearer, and global surveillance of seasonal and emerging IAV strains continues to be a major public health concern. Glycosylation of the envelope protein hemagglutinin (HA) of IAV has been shown to affect viral antigenicity and immunogenicity (3,4). HA is the main avenue for interaction between IAV and host cells, and HA glycans may modulate receptor binding and allow for evasion of immune responses (5). Therefore, the global surveillance of IAV, which currently only involves genome sequencing, should include detailed and quantitative assessment of glycosylation changes on IAV glycoproteins as well.
It is important to maintain and improve annual vaccine readiness for the seasonal flu, especially because of the potential for the emergence of new epidemic strains. Most IAV vaccine production uses embryonated chicken eggs, but this greater than 70-year-old technology (6) has many drawbacks. The manufacturing process requires a large supply of high-quality eggs. Each dose of an inactivated vaccine requires three or four eggs for tri-and quadrivalent vaccines, respectively. Scaling up production in a short amount of time, such as at the onset of an influenza pandemic, requires 6-12 months lead time to build up chicken flocks. Extensive manufacturing protocols are needed to test for the microbial burden in eggs, and potential exposure of chicken flocks to avian influenza could further restrict egg supply (7). Furthermore, egg-expressed viruses exhibit adaptive mutations in the HA receptor-binding domain to enable binding to avian cellular receptors and infect chicken cells. The changes in receptor binding and antigenicity of these egg-adapted viruses have resulted in decreases vaccine effectiveness (8).
The World Health Organization has highlighted the urgent need for pursuing cell culture as an alternative means of production to increase the availability of influenza vaccine (9). The one mammalian cell-based inactivated influenza vaccine licensed for use by Food and Drug Administration virus grown in Madin-Darby canine kidney (MDCK) cells (10). Such mammalian cell lines, hold advantages over egg-based production. Large stocks of MDCK cells may be frozen and banked in anticipation of pandemic strain emergence. MDCK cells circumvent the problem of egg adaptation. Amino acid sequence of receptor binding and antigenic regions of virus passaged in MDCK culture were found to be identical to those of clinical isolates, while egg-grown virus exhibited key mutations in those regions (11). In the J o u r n a l P r e -p r o o f 2017-2018 influenza season, cell culture-derived vaccine was observed to have a higher relative vaccine effectiveness than the egg-based vaccine among individuals aged 65 years (12).
Furthermore, as noted above, glycosylation plays an important role in antigenic shielding and could affect the immunodominance of antigenic sites. Therefore, quantitative comparisons of glycosylation between the two expression systems would benefit the efforts to improve global surveillance efforts vaccine efficacy.
Our previous work (13) using data-dependent (DDA) mass spectrometry showed measurable glycosylation changes among IAV variants, even when the amino acid substitutions did not affect glycosylation sequons. DDA is a popular method that is capable of quantitative, site-specific glycoprotein characterization. However, precursor ion selection in DDA is stochastic, compromising run-to-run reproducibility, and is biased towards higher abundance glycopeptides, leaving lower abundance ones under-sampled. Data-independent acquisition is an alternative method that has the potential to reduce bias and increase sensitivity and selectivity compared to DDA. Recent work has shown that DIA can improve sensitivity and confident quantification for intact glycopeptides (14)(15)(16); however, due to the unique complexities of the characterization and quantification of glycopeptides, there remain challenges to overcome. Nlinked glycopeptides generally have much lower abundances than unmodified peptides, with a potentially vast hetereogeneity of glycan structures at each site. In addition, glycopeptides that share the same peptide backbone sequence but have different glycans often overlap in reversedphase chromatography, leading to coeluting precursors that make MS/MS spectra interpretation difficult. MS/MS spectra interpretation is further complicated by the ubiquitous presence of highly abundant oxonium fragment ions that are common to many glycopeptides. In this work, we enriched viral samples for glycopeptides prior to mass spectrometry analysis to maximize the J o u r n a l P r e -p r o o f glycopeptide signal, we used a relatively small DIA scanning window size to minimize coeluting precursors, and we used a relatively high collision energy to encourage the appearance of more information-rich fragment ions. We determined that DIA performed better than DDA using a standard glycoprotein alpha-1-acid glycoprotein (AGP). We showed in the A/Philippines/2/1982 (Phil82) IAV sample that there were some specific glycosylation sites with low abundance glycopeptides for which DIA was not as sensitive as DDA. Using DIA, we quantified the sitespecific glycosylation of HA from wild-type strain A/Switzerland/9715293/2013 (H3N2) (SWZ13) and a mutated variant of SWZ13, comparing chicken egg and MDCK cell expression systems. We recapitulated our previous work showing a site of low similarity between the WT and mutant egg-expressed virus in the 165-NVTM glycosylation site (13). We determined that the difference at this site is less pronounced between the WT and mutant cell-expressed virus.
We also observed in the egg versus cell comparisons, the 285-NGSI and 483-NGTY sites are highly similar and highly dissimilar, respectively.

Experimental procedures
We used purified human alpha-1-acid glycoprotein (AGP) (Sigma-Aldrich (St. Louis, MO; cat # G9885) and the A/Philippines/2/1982 (Phil82) strain of IAV as glycoprotein standards. Phil82 was expressed in chicken eggs, and were generously provided to us by Dr. DIA acquisition. MS1 spectra were acquired with resolution 60,000 at m/z 400, 1 microscan per spectrum, scan range m/z 350-1800, 32 ms maximum injection time, and AGC target of 3e6. For MS2, we used 50 non-overlapping isolation windows, 16 m/z wide, to cover the range of m/z 800-1600. Spectra were acquired with resolution 30,000 at m/z 400, 2 microscans, 32 ms maximum injection time, AGC target 1e6, and NCE 35. Profile and centroid data were recorded for MS1 and MS2, respectively.

Data analysis
All raw files were first converted to mzML format (17) using MSConvert (18) (22). DDA preprocessed used default GlycReSoft parameters. For DIA, we first performed deisotoping and charge state deconvolution, then using DIA-Umpire methodology (23), we used retention time to limit MS2 spectra with co-isolating precursor ions. Only precursors with an apex retention time within 0.6 min of the product apex time were used, to account for retention time shifts due to co-isolated precursors with shared product ions. Using a Pearson correlation, we correlated the precursors with the product ions, retaining up to the top 20 co-isolating precursors. For these precursors, MS2 spectra were duplicated to contain information for a single precursor while retaining all of the product ion information. After the preprocessing step, subsequent GlycReSoft searching steps were the same for DDA and DIA files.
The glycopeptide search space was built from the proteomics mzIdentML file and a glycan search space which was generated combinatorially following these glycan composition rules and restrictions. The number of Hex residues was 3-10, the number of HexNAc residues was 2-9, the number of Fuc residues was 0-5, and the number of Neu5Ac residues was 0-4. The number of Fuc was required to be less than the number of HexNAc, and the number of HexNAc was required to be greater than the number of Neu5Ac + 1. In addition, ≤1 sulfation and ≤1 ammonium adduct were considered. The same glycomics search space was used for AGP and IAV samples. GlycReSoft weighted and reported the score towards the localization with the fewest missing ions using a parsimony filter.
We used a false discovery threshold of 0.05. GlycReSoft used a peptide-centric method for scoring glycopeptide-spectrum matches, building upon the previously used coverageweighted binomial model (22). Because we used a relatively high single collision energy for fragmentation without calibration for specific precursor m/z and charge states, we did not J o u r n a l P r e -p r o o f observe peptide+Y ions consistently for all glycopeptides, producing low scores for spectra that were otherwise well matched. Therefore, we did not allow any peaks matching theoretical peptide+Y ions to be included in scoring.
The abundances of glycopeptides that passed the FDR threshold of 0.05 were quantified by aggregating the all MS1 peak areas associated with each glycopeptide identity. Peptides with varying cleavages or PTMs that had the same glycoform were summed. If a glycosite was identified in both tryptic and chymotryptic digestions, only we retained only the set with the highest average abundance across replicates. Only glycan compositions that were observed in at least two replicates were considered positive identifications.

Tanimoto similarity analysis
A detailed discussion of the Tanimoto similarity metric in the context of glycoprotein glycosylation has been previously published (13). We used an R toolkit called RAMZIS (24)  the false positive (α) and false negative (β, or equivalently, 1power) rates. Thus, if we require a significance level of 0.05 (α = 5% of the joint area of both the null and test distributions) and 80% power (β = 20% of the joint area), then an overlap area of ≤25% indicates that we have statistically significant evidence to distinguish false positives and that the data had sufficient power to make a statistically confident conclusion about the similarity of the two samples (13).
The internal quality plots (Figures S7-S8) display internal distributions, composed of Tanimoto coefficients calculated from combinations of replicates within a single sample. RAMZIS uses a few key metrics to evaluate the dataset. The internal quality of the dataset is measured by the confidence score of the internal similarity distributions. This score is relative to the comparison and should be above 2 to reliably distinguish changes in the glycosylation patterns. The observed similarity (i.e., the test similarity between the original, unbootstrapped samples) should be within the central quartiles (25-75%) of the test similarity distribution. This ensures that RAMZIS is accurately simulating the comparison. As a secondary fail safe to prevent spurious differentiation, the overlap between the internal and test distributions should be less than 25% of their total joint area for any differentiable glycosite. Finally, a glycosylation distribution is only differentiable if it has passed the above quality tests and it sees that the test and null similarity distributions have minimal overlap; there should be less than 5% of area overlap in the false positive region, in grey, where the null is greater than the test, and less than 20% of the false negative region, in black, where the test is greater than the null (13,25).

Experimental design and statistical rationale
We acquired three technical replicates for each enzyme-digested sample of AGP, and four technical replicates for all Phil82 and SWZ13 enzymatic digestions. AGP and Phil82 were used J o u r n a l P r e -p r o o f as control samples. All replicates were randomized in the acquisition queue to minimize batch effect bias.

Comparisons of DDA and DIA using glycoprotein standard samples, AGP and Phil82
Glycoforms from all five sites of the A1AG1 isoform of AGP were observed using both DDA and DIA. The number of glycoforms at each site that were observed in at least two out of three replicates are shown in Table 1 and Figure 1. In all five glycosylation sites, DIA identified more glycoforms. Phil82 data are shown in Table 1  The number of glycoforms observed in at least two replicates for these four sites, identified by each acquisition method, is shown in Table 2. (Table S1 shows the data for all other glycosylation sites.) Figure S3C-F displays density plots for the MS1 abundance for all precursors meeting the GlycReSoft FDR threshold of 0.05, from all replicates and enzymatic digestions. Just as in aAGP and Phil82 HA, the median abundances of DIA precursors were lower than those of DDA precursors.
In Supplemental File 3, we compare annotated spectra for glycopeptides that were identified by DDA, but missed by DIA, showing the MS2 score of the matched DDA spectra to the expected score for the DIA spectra where the glycopeptide would have been found. In general, the DIA spectra had fewer b-and y-ions matching the expected fragment ions, resulting in lower MS2 scores than for DDA.

J o u r n a l P r e -p r o o f
We compared the four SWZ13 variants for the four glycosylation sites that were consistently in the following combinations: WT egg vs mutant egg ( Figure 3A), WT cell vs mutant cell ( Figure 3B), WT egg vs WT cell ( Figure 4A), and mutant egg vs mutant cell ( Figure   4B). The bar plots display the aggregated MS1 abundances for the glycoforms at each site by data acquisition method. The Tanimoto similarity plots were made using DIA-acquired data.
Internal quality plots for each comparison can be found in Figures S7 and S8. As described elsewhere (13), the internal quality plots are a quality control measure that provides guidance for how confident one can be in the data set. In general, the green internal distributions in these figures, represent the quality of the data set for each experimental group; a tall and narrow internal distribution is considered better quality than a broad one. For example, site 8-NSTA had the broadest internal distributions, correlating to the fewest number of identifications as well. A summary of the Tanimoto similarities for all combinations is shown in Table 3.

Comparing DIA and DDA performance
Although, we found that DIA was overall more sensitive than DDA, this was not always the case, and we learned that weighing the merits of both methods requires some nuanced thought. DIA is dependent on fast scanning speeds of instrument hardware, which are often not fast enough to achieve the sensitivity needed to identify all glycopeptides in a mixture, especially when the sample mixture is very complex and abundant. In cases where the sample is too complex and abundant, DIA fails to achieve its purported sensitivity. We found this to be especially true for Phil82, which had the most amount of glycopeptides injected on column J o u r n a l P r e -p r o o f compared to all the other samples. We had attempted to inject the same amount of material for all samples, but it is clear from Figure S9, which shows total ion chromatograms (TIC) from selected tryptic digestions of each sample type, that Phil82 had the most complex TIC and highest TIC signal compared to the other sample types. AGP had approximately the same signal intensity, but AGP is less complex, and does not have contaminating host proteins like an IAV sample does. The SWZ13 WT and mutant samples are potential more complex than Phil82, having more glycosylation sites; however, it is clear that IAV protein concentrations that we started with in the digestion protocol was much smaller in all SWZ13 sample compared to Phil82. The TICs of the four SWZ13 samples had lower overall signal intensity and did not the many lower abundance glycopeptides peaks eluting as Phil82 did. This also explains why the SWZ13 samples had fewer overall glycopeptides identified than Phil82 did (Tables 1 and 2).
To better understand why DIA underperformed in Phil82, we parsed the raw data file from one DDA replicate of the tryptic Phil82 sample to extract instrument cycle information using RawTools (26). To calculate the time spent per scan, we divided the duty cycle by the number of MS2 fragmentation events triggered per cycle plus 1 (for the MS1 full scan). We obtained the same information for one DIA replicate of the tryptic Phil82 sample using the ms_deisotope tool from GlycReSoft. The time spent per scan for both DDA and DIA are shown in Figure S4. Across the entire retention time range, the duty cycle, or time between MS1 scans, is longer for DIA than for DDA because DIA had more MS2 scans per cycle, 50 for DIA compared to ≤ 20 for DDA. The time spent per scan for DDA ranged from 160 ms to 250 ms.
DIA is a fully deterministic method, with every cycle consisting of a full MS1 scan followed by 50 MS2 scans, so the time spent per scan stayed constant at approximately 155 ms across the time range. Thus, unlike for DDA, low abundance precursors did not have extra accumulation time, which would result in less intense fragment ions in the tandem mass spectra. Not only was DIA accessing precursors of lower abundance than DDA was ( Figure S3B), but the precursors that were identified by DDA but missed by DIA in the Phil82 samples had the lowest median abundance ( Figure S5). This is consistent with the product ion abundances of the glycopeptides from the four sites in which DIA underperformed. We compared the number and abundance of the b-and y-ions for DDA MS2 spectra that were positively identified versus those ions in the DIA MS2 spectra where they would have been identified but were missed ( Figure S6). We found that the b-and y-ions for these sites were fewer in number and less in abundance for the DIA spectra than for DDA. Our glycopeptide scoring algorithm was set to use solely b-and y-ion matches and, therefore, these sites had fewer identifications. In other words, the precursors that were missed by DIA had the lowest overall abundance, but the acquisition method could not adjust the accumulation time in the mass spectrometer, which explains why the fragment ions might be fewer and smaller.
AGP was a much less complex system, with fewer glycosylation sites, and the SWZ13 samples were lower in concentration at the outset, and for these samples, the sample complexity was sufficiently simple for DIA perform better than DDA. Thus, with the current hardware available to us, DIA would be a good choice to maximize sensitivity and quantification provided the sample matrix is not too complex. For relatively high concentration, high complexity samples with a background of host proteins, DIA alone would not be a good choice.
Supplemental File 3 displays annotated spectra for glycopeptides that were positively identified compared to the DIA spectra in which those same glycopeptide precursors would have appear, if they had passed the scoring threshold.

SWZ13 IAV comparisons
The WT egg vs mutant egg comparison provides a confirmation of the conclusions from our previous work (13). This previous paper, which used only DDA acquisition showed that in egg-expressed viruses, there was a measurable difference in glycosylation between the WT and 5B8 mutant. The present work recapitulates the difference in quantification of the Man9GlcNAc2 glycoform at site 165-NVTM, with Man9GlcNAc2 having lower abundance in the mutant compared to WT. Also reproducible is the high similarity between WT and mutant for site 285-NGSI. Although the data for site 483-NGTY are less confident in the present work, nonetheless, we observed similar quantification for the most highly abundant glycoforms for this site. All of these provide reassuring confirmation that that our analysis is reproducible.
Due to the sparseness of data for site 8-NSTA, we cannot conclude that this site is similar or dissimilar in any of the comparisons we made. However, the other three sites provide data with enough confidence for us to begin to understand how protein sequence changes or different Although the glycosylation state for this site is highly dissimilar in the two expression systems, it is likely that this site would not be involved in immune evasion or receptor binding. The three sites we compared (165-NVTM moderately high similarity, 285-NGSI high similarity, and 483-NGTY low similarity) are preliminary evidence that glycosylation would not affect the binding of antibodies against the HA head group but could conceivably alter binding of antibodies against the stalk.

Conclusion
The glycoproteomics community strives for improving coverage and confidence in glycopeptide assignments, and DIA mass spectrometry holds much promise in this area. Our work demonstrated that DIA improved the sensitivity of site-specific glycopeptide characterization in most cases, allowing for rigorous quantitative comparison of glycosylation similarity However, more work is needed to reliably achieve the DIA promise of complete J o u r n a l P r e -p r o o f p. 20 coverage and high confidence quantification for low abundance glycoforms. In this work, we used a relatively narrow DIA scanning window of 16 m/z windows but still found that spectra had contaminating fragment ions from coeluting precursors. Smaller scanning windows would reduce coelution at the expense of sensitivity. However, with instruments capable of faster scanning speeds, a better balance between specificity and sensitivity may be possible, especially for less complex samples. We also employed a single, relatively high collision energy which did not always produce high-abundance peptide+Y fragment ions in the tandem mass spectra, producing artificially low scores in what otherwise would have been well-matched spectra.
Optimizing collision energies based on the characteristics of the glycopeptide analytes as well as developing new scoring algorithms should help to produce higher quality tandem mass spectra and to capture information for all glycopeptide fragment ion types, respectively. RAMZIS and GlycReSoft may be freely downloaded at https://www.bumc.bu.edu/msr/software/.