Measuring Site-specific Glycosylation Similarity between Influenza a Virus Variants with Statistical Certainty

Antigenic drift in influenza A virus results in poor vaccine effectiveness. Accumulating N-linked glycosylation sites on the envelope protein hemagglutinin shield antigenic regions from adaptive immune responses. Quantitatively understanding how glycosylation similarity correlates with changes in protein sequence variation or viral expression platforms is necessary for improving vaccine design. These results presented demonstrate that a mutant strain of influenza has measurably distinct glycosylation compared to its wild-type counterpart.

Influenza A virus (IAV) mutates rapidly, resulting in antigenic drift and poor year-to-year vaccine effectiveness. One challenge in designing effective vaccines is that genetic mutations frequently cause amino acid variations in IAV envelope protein hemagglutinin (HA) that create new N-glycosylation sequons; resulting N-glycans cause antigenic shielding, allowing viral escape from adaptive immune responses. Vaccine candidate strain selection currently involves correlating antigenicity with HA protein sequence among circulating strains, but quantitative comparison of site-specific glycosylation information may likely improve the ability to design vaccines with broader effectiveness against evolving strains. However, there is poor understanding of the influence of glycosylation on immunodominance, antigenicity, and immunogenicity of HA, and there are no well-tested methods for comparing glycosylation similarity among virus samples. Here, we present a method for statistically rigorous quantification of similarity between two related virus strains that considers the presence and abundance of glycopeptide glycoforms. We demonstrate the strength of our approach by determining that there was a quantifiable difference in glycosylation at the protein level between WT IAV HA from A/Switzerland/9715293/2013 (SWZ13) and a mutant strain of SWZ13, even though no N-glycosylation sequons were changed. We determined site-specifically that WT and mutant HA have varying similarity at the glycosylation sites of the head domain, reflecting competing pressures to evade host immune response while retaining viral fitness. To our knowledge, our results are the first to quantify changes in glycosylation state that occur in related proteins of considerable glycan heterogeneity. Our results provide a method for understanding how changes in glycosylation state are correlated with variations in protein sequence, which is necessary for improving IAV vaccine strain selection.
Understanding glycosylation will be especially important as we find new expression vectors for vaccine production, as glycosylation state depends greatly on the host species.
In the 2017-2018 influenza season, nearly 80,000 influenza-related deaths were reported in the U.S., the highest death toll in 40 years (1). Effective vaccines help to mitigate disease, but influenza A virus (IAV) mutates rapidly, resulting in antigenic drift that causes the need for frequent updates to the influenza seasonal vaccine strains. IAV mutates during viral evolution in two ways. First, antigenic drift causes changes in the protein sequence to accumulate over time, altering the antigenic reactivity of the virus. Such changes may add N-glycosylation sequons that cause antigenic shielding, alter receptor binding avidity, and change the interactions with innate immune system lectins (2). Second, antigenic shift occurs when IAV RNA segments recombine with zoonotic genes to create new pandemic strains (3). New pandemic strains start out with very low glycosylation in the head group, reflecting different selective pressures on IAV in animal hosts than in humans.
The antigenic character of the abundant IAV envelope protein hemagglutinin (HA) depends on protein sequence (4,5) and glycosylation state (6,7). There are defined antigenic regions of HA protein sequence that evolve rapidly (8). Antigenic regions must balance their ability to evade host adaptive immune response with maintaining receptor-binding function (9). HA antigenicity can be assessed through hemagglutination inhibition (HI) data. Correlating HI with protein sequence information allows prediction of antigenicity based on protein sequence (10)(11)(12). HA surface antigenicity is affected strongly by the presence and types of N-glycosylation, which cannot be predicted from genetic sequence or 3D protein coordinates (13). N-Glycosylation sequons accumulate in the protein sequence, giving rise to increasing glycosylation as the strain evolves, shielding of antigenic sites (8,14,15), influencing receptor binding (16,17) and binding to innate immune system molecules (18,19).
Previous research demonstrated the power of MS to identify HA N-glycans site-specifically (20)(21)(22)(23). However, most glycosylated HA models in the literature were constructed using unoptimized, generic glycans without defined information about the glycan antennae. Our recent MCP article is an exception (24). N-Glycosylation is microheterogeneousthere is a distribution of possible glycans at each site-and certain sites may be inaccessible to processing by ER mannosidases compared with other sites (24). Varying degrees of glycan processing affect the lectin-binding and other host interaction properties at these sites. HA is macroheterogeneous; a population of HA proteoforms exists. The glycosylated proteoforms that are most immunogenic are not known. To correlate antigenicity and immunogenicity with glycosylation, rigorous comparison of glycosylation similarities among strains is required (25). Further, the viral expression systems that can produce the most immunodominant glycans are also unknown.
Year-to-year influenza vaccine effectiveness is low, ranging from 10-60% in the years 2004-2018 (26). For effective vaccine design, antigenicity of the candidate vaccine virus must be somewhat similar to current circulating strains, but not too similar (27). Quantifying the similarity of HA strains by correlating protein sequence with antigenicity can be achieved using antigenic cartography, which is a method for mapping the antigenic distance of related protein sequences in a two-dimensional space (10,28). For rigorous quantification of antigenicity, we assert that the similarity of glycosylation state of the candidate virus HA to circulating strains must be included as well. However, a quantitative representation of glycosylation similarity between two strains is not a trivial task because of the heterogeneities of glycoforms and occupation state. Further, because IAV uses host glycosylation machinery, the specific expression platform used to manufacture the vaccine greatly affects the resulting glycosylation of the vaccine virus.
Cao et al. published a method for semi-quantification of viral glycoprotein glycosylation using differential glycosidase digestion and proteomics (29,30). This method determines percent site occupancy and determines the degree to which a glycosite is occupied by high mannose or complex-type Nglycans. For comparison of glycosylation similarity, however, it is necessary to quantify each glycosite glycoform as rigorously as possible. Currently, there are no well-tested methods for making such rigorous glycosylation similarity compar-isons among virus samples. The challenge is that the ability to both identify and quantify sample glycopeptide glycoforms is limited by mass spectrometer instrument speed.
Here, we present a method, using the Tanimoto similarity metric (31,32), for the quantification of statistically rigorous similarity between two related virus strains that considers the presence and abundance of glycopeptide glycoforms. We describe the importance of liquid chromatography-mass spectrometry (LC-MS) data quality on the ability to make rigorous statistical similarity comparisons. Using the Tanimoto metric, we demonstrate our approach to estimate similarity between two sources of IAV, both site-specifically and at the whole protein level. The work presented here was performed using chicken egg-derived IAV, and therefore, the HA glycosylation would not likely be the same as those from human hosts. However, because the majority of IAV vaccines are manufactured using chicken eggs, it is still crucial to quantitatively compare the glycosylation changes between related strains in this system.

Generation of Influenza Viruses Using Reverse Genetics-
The full-length cDNA for HA or a mutated gene and neuraminidase genes of influenza A/Switzerland/9715293/2013 (H3N2) virus were amplified by using SuperScript One-Step RT-PCR (Invitrogen, Grand Island, NY), and the 6:2 recombinant viruses with six internal genes of influenza A/PR/8/1934(H1N1) virus were generated by using reverse genetics as described elsewhere (33). The mutated HA gene has three mutations Q132H, Y219S, and D225N, compared with the WT HA gene, and we name the mutant as 5B8.
Viral Glycopeptide Sample Preparation-A/Switzerland/9715293/ 2013 (H3N2) virus and the mutant 5B8 virus were propagated in specific pathogen free embryonated chicken eggs at 37°C for 72 h. A/Philippines/2/1982 (Phil82) samples, expressed in embryonated chicken eggs, were a generous gift from Dr. Kevan Hartshorn of Boston University School of Medicine. The viruses were purified using ultra-centrifuge as described elsewhere (34). Viral samples were sonicated for 15-20 min in methanol to lyse virion membranes, and then evaporated to dryness in a centrifugal evaporator. Glycoprotein solutions were denatured and reduced with 100 mM ammonium bicarbonate (J.T. Baker, Phillipsburg, NJ), 2-2-2 trifluoroethanol (TFE) (Sigma-Aldrich, St. Louis, MO, #T63002), and 200 mM dithiothreitol (DTT) (Sigma-Aldrich, St. Louis, MO), incubating for 1 h at 65°C. Cysteine residues were alkylated by incubating with 200 mM iodoacetamide (Bio-Rad, Hercules, CA) for 1 h at room temperature in the dark. After alkylation, excess iodoacetamide was quenched with additional DTT, again incubating for 1 h at room temperature in the dark. The amount of TFE in each mixture was diluted to at most 5% with the addition of a 3:1 solution of water/100 mM ammonium bicarbonate. Sequencing-grade trypsin or chymotrypsin (Promega, Madison, WI) was added at an enzyme:substrate ratio of 1:20, and the solutions were incubated overnight at 37°C. After incubation, the enzymes were inactivated by heating the samples for 10 min at 95°C. A small aliquot from each sample was reserved for deglycosylation, and the remainders were enriched using iSPE-HILIC solid-phase extraction columns (HILICON, Umeå, Sweden), as follows. Samples were evaporated to dryness, and the HILIC columns were conditioned with water, then acetonitrile, and equilibrated with 80% acetonitrile/1% trifluoroacetic acid (TFA). Immediately before loading onto HILIC SPE columns, samples were reconstituted in 80% acetonitrile/1% TFA. Columns were washed in sample loading buffer, and then eluted with 1% acetonitrile/0.1% formic acid. Samples were evaporated to dryness, and then resuspended in 1% acetonitrile/0.1% formic acid for LC-MS injection.
The reserved aliquots were incubated with 500 units PNGase F (New England Biolabs, Ipswich, MA) for every 20 mg of glycoprotein overnight at 37°C. After incubation, de-glycosylated samples were cleaned using PepClean C18 spin columns, (Pierce Biotechnology, Rockford, IL), the eluate containing de-glycosylated peptides.
Exoglycosidase Modification of AGP and a/Philippines/2/ 1982 Samples-Alpha-1-acid glycoprotein (AGP) purified from human serum (Sigma #G9885) and A/Philippines/2/1982 (Phil82) were digested in parallel with trypsin and chymotrypsin, as above. Before HILIC enrichment, the digested AGP glycopeptides were incubated with 200 units of a2-3,6,8 Neuraminidase (New England Biolabs # P0720S) for every 20 mg of glycoprotein overnight at 37°C. This was to remove all terminal sialic acid residues from AGP glycans. The following day, the asialo AGP glycopeptides and digested Phil82 glycopeptides were divided into two equal aliquots. The first aliquot was HILIC-enriched, as above. The second aliquot was incubated overnight at 37°C with b1-4 galactosidase (New England Biolabs #P0745) using the same reaction conditions as the neuraminidase above, and then HILIC-enriched. The aliquots with terminal galactose residues removed were denoted AGPgal and Phil82gal.
Data Analysis-All raw files were first converted to mzML format (35) using MSConvert (36) from ProteoWizard version 3.0.11252 with no additional filters.
Proteomics data from de-glycosylated samples were processed using the "Peaks PTM Search" function of the Peaks Studio 8.5 software (Bioinformatics Solutions, Waterloo, ON) (37). AGP was searched against the entire Uniprot Swiss-Prot (38) database containing 555,085 valid entries (release 2017_04) with the species taxonomy set to Homosapiens. Databases for Phil82 and the SWZ13 variants consisted of HA and neuraminidase (NA) sequences from each specific strain and internal IAV protein sequences from A/Puerto Rico/8/1934 (PR8) appended to the host proteome, Gallus gallus (UP000000539 9031; release 2017_06). The Phil82 and SWZ13 databases had 29,551 and 29,628 entries actually searched, respectively. Relevant FASTA files can be found in supplemental Figs. S1 and S2. Technical replicates were searched together. Trypsin or chymotrypsin was specified as the proteolytic enzyme, with maximum of 3 missed cleavages and one nonspecific cleavage were allowed. Cysteine carbamidomethylation and methionine oxidation were specified as fixed and variable modifications, respectively. The precursor ion (MS1) mass error tolerance was specified at 10 ppm and a fragment ion (MS/MS) error tolerance at 0.02 Da. Protein identifications required a minimum of two unique peptides. Default settings were used for setting the threshold score for accepting individual spectra. We used a target-decoy false discovery threshold of 1% at the peptide level. All peptides identified by Peaks Studio are listed in supplemental File 1, including precursor charge, m/z, all modifications observed, peptide identification score, accession numbers, and protein groups of the proteins corresponding to the identified peptides. Protein abundances were calculated by aggregating the MS1 peak areas for all the peptides deconvoluted and identified for a specific protein; supplemental File 2 shows the aggregated areas and the number of distinct peptides assigned for each protein. Peaks Studio search results were exported in mzIdentML (24) format.
LC-MS/MS raw files for all intact HILIC-enriched glycopeptides were processed using the GlycReSoft search engine, a complete open-source software package that assigns glycopeptides from tandem mass spectra (39). GlycReSoft requires the input of a peptide hypothesis, obtained in this work from the Peaks Studio proteomics search described above, and a glycan hypothesis. We constructed the glycan hypothesis combinatorially, following these glycan composition rules: #Hex 3-10, #HexNAc 2-9, #Fuc 0-5, #Neu5Ac 0-4, with #Fuc , #HexNAc and #HexNAc . #Neu5Ac 1 1. The same glycan hypothesis was used for AGP and IAV samples, alike. GlycReSoft default values were used for modification localization thresholds. Glycopeptide abundances were quantified by aggregating the MS1 peak areas of the glycoforms deconvoluted and identified by GlycReSoft, see supplemental Fig. S3. All replicate runs for the glycopeptide assignments of a particular sample were combined into one table, displaying glycopeptide abundances for each replicate (supplemental File S3). Tables and annotated spectra for all glycopeptides, including those with ambiguous localizations are provided in supplemental Files S4-S12.
Bioinformatics Similarity Analysis-The glycopeptide identifications from the GlycReSoft searches included variations in peptide backbone resulting from missed cleavages, or variable oxidation of methionine or deamidation of asparagines. Before performing the bioinformatics analyses, the abundances of all peptide variants from a given sequon with the same glycan composition were summed together so that for each sequon, glycan compositions were not duplicated. In addition, glycoforms with only a single observation across replicates were discarded.
We used a modified Tanimoto coefficient, T, to calculate the similarity of glycosylation between two systems: A and B are vectors assigned to the experimental groups being compared, containing the abundance values of each glycopeptide in the set. For each glycopeptide, the abundances are log scaled to give low weight to low abundance peaks, standardized to be in range [0,1], and then averaged across technical replicates, not including missing values. P A and P B are vectors containing the proportion of observed values to the total possible observations across technical Glycopeptide Similarity replicates. K is a distance scaling term equal to ½11meanðP A ; P B Þ, and d(A, B) is the Manhattan distance between A and B. First, a "null" distribution composed of Tanimoto coefficients is calculated by comparing randomly selected combinations (with replacement) of replicates from A and B groups combined. The null distribution should be near 1 (very high similarity) because the groups are sampled equally. A "test" distribution is made by calculating the Tanimoto coefficients of randomly selected combinations of replicates from group A with those from group B. The observed Tanimoto similarity is the Tanimoto coefficient calculated from all replicates from A to all those of B. The degree of overlap of the null and test distributions reflects the confidence with which we can quantify the glycosylation similarity between the two groups.
The software used in the bioinformatics analysis is freely available upon request.
Experimental Design and Statistical Rationale-For each enzyme-digested sample of AGP, AGPgal, Phil82, and Phil82gal, at least three technical replicates were acquired (supplemental Table  S1). Because of malfunction of the LC unit, one replicate for Phil82 (chymo) and mutant 5B8 IAV HA (chymo) were run on a different Aquity LC unit, but with the same mobile phases and method. As a result, we expected that there would be more differences in glycopeptide identifications and abundances for these two samples between replicates.
AGP and Phil82 were used as control samples. All samples and technical replicates were acquired by LC-MS in randomized order.

RESULTS
Tanimoto Similarity Coefficients-The Tanimoto similarity metric comparing AGP versus AGPgal, Phil82 versus Phil82gal, and mutant 5B8 versus WT SWZ13 HA, were calculated for the proteins as a whole and site specifically. Mutant 5B8 HA had three amino acid substitutions from the WT HA, Q132H, Y219S, and D225N. Similarity distributions are presented in Fig. 2 and Table I for the whole protein comparisons, and in Figs. 3, 4, and 5, and Table II for the site-specific comparisons.
Tanimoto Coefficient Plots-We used the Tanimoto coefficient to estimate glycosylation similarity between two strains of HA, taking into account the abundances of the distribution of glycan compositions, including measurement variabilities. As shown in the example plot in Fig. 1A, we represented the data visually by plotting null and test distributions of glycans on a similarity scale. The x axis shows the Tanimoto similar-ity. The y axis is the density of the distributions (analogous to frequency in a histogram plot).
The null distribution (blue area in Fig. 1A) was made by calculating the Tanimoto coefficient between randomly selected combinations (with replacement) of replicates from sample 1 and sample 2 combined. Because these combinations of replicates were randomly selected from the same combined data set, the distribution of Tanimoto coefficients should be close to 1, or perfectly similar. The test distribution (red area), was composed of Tanimoto coefficients comparing randomly selected combinations (with replacement) of replicates from sample 1 to those of sample 2. The observed Tanimoto similarity (the thick vertical line) is the Tanimoto coefficient calculated from all the replicates of sample 1 to all those of sample 2.
The null and test distributions and their overlap are analogous to null and alternative hypotheses and the test statistic in statistical inference. That is, if the null and test distributions do not overlap significantly, we then reject the null hypothesis and conclude that the two distributions are significantly dissimilar. Else, we fail to reject the null hypothesis because we do not have statistically significant evidence to conclude that the distributions are dissimilar.
The area of overlap to the left of the point of intersection between the two distributions (grey area) is analogous to the Type I Error (a), or the false positive rate, and the area to the right (black area), the Type II Error (b), or false negative rate (equivalently, 1power). A significance level of a = 0.05 (i.e. the grey area is 5% of the joint area of both the null and test distributions) means that for 5% of Tanimoto coefficients in the joint test and null distributions, we do not have statistically significant evidence to determine whether they are false positives or not, and therefore, we conservatively conclude they are false positive. A power of 80% (i.e. b = 0.2) means that the black area is equal to 20% of the joint area of the null and test distributions. A black overlap area of greater than 20% indicates that our data are not sufficiently powered   1. Example Tanimoto plots. Tanimoto similarity is represented on the x axis, with perfectly dissimilar at 0 and perfectly similar at 1. The observed similarity is represented by the vertical black line. The a and b values are the Type I and II errors, respectively. A confidence score of 77 indicates good confidence. A, Ideal plot with narrow null and test distributions and good resolution between the two. B, Poorly resolved null and test distributions, but the distributions are narrow and the confidence score is high, indicating that the two groups are highly similar. C, Poorly resolved null and test distributions, but distributions are wide, indicating poor data quality, and the confidence score is , 77. D, Inconsistencies among sample replicates results in multimodal distributions.
to make a confident conclusion about the similarity between samples. Fig. 1A is an example of an ideal plot with narrow null and test distributions and good resolution between the two. The overlap areas were a = 0.03 and b = 0.00, indicating acceptable levels of Type I and II errors, respectively. Fig. 1B-1D are examples of poorly resolved distributions. The distributions in Fig. 1B overlapped substantially (a = 0.43 and b = 0.30; however, each individual distribution is narrow, indicating that these samples are highly similar. Fig. 1C had very wide distributions because of large variances within both distributions. This may be an indication of poor data quality, and without further experimental validation, we cannot conclude that these samples are dissimilar. Fig. 1D is an example of samples that have one or more replicates that were very distinct from the others, resulting in clustering of some comparisons within the distribution that shows up in the plot as multimodal distributions, an indication that there may have been problems with some sample replicates.
As a data quality control measure, we can overlay internal distributions, composed of Tanimoto coefficients from randomly selected combinations of replicates from a single sample, over the null and test distributions. The plots in supplemental Fig. S4 are the internal quality plots corresponding to the four plots in Fig. 1. For both samples 1 and 2 in supplemental Fig. S4A, the internal distributions were tall and narrow, an indication of good data quality. We can measure the tallness and narrowness of the internal distribution by calculating the ratio of height to variance. In supplemental Fig.  S4A, the distribution for sample A had a height of 4.37 and a variance of 0.00054. Therefore, the distribution ratio for sample 1 was 8135; that for sample 2 was 570. For our purposes, an acceptable internal distribution ratio is 100, that is, the distribution should be at least 100 times as tall as it is wide to be considered good quality. All internal distribution ratio information can be found in supplemental Table S2.
Our confidence of the observed similarity depends on the a and b overlap areas in addition to the internal distribution ratios. For our data, we combined these pieces of information to obtain one confidence score, defined as e 2ða1bÞ multiplied by the lower of the two internal distribution ratios. The minimum acceptable confidence would be when a 1 b = 0.25, a minimum internal ratio of 100, or equivalently, e 20:25 3 100 ¼ 77: The example plots in Fig. 1A and 1B both had scores greater than 77. Those in Fig. 1C and 1D were both less than 77, and therefore, we do not have enough evidence to make a conclusion about their similarities.
Similarity of Exoglycosidase-Generated Pairs-To evaluate the performance of the Tanimoto coefficient, we used wellcharacterized standard glycoproteins including AGP. The A1AG1 isoform of AGP has five N-glycosylation sites, each containing a distribution of complex glycans (40). If we modify those glycans by removing all terminal sialic acid and galactose residues, then the glycosylation of the modified pro-tein, which we denote as AGPgal, is expected to have zero similarity to the unmodified protein. This turned out to be the case ( Fig. 2A and 2B). Because AGP has only complex glycans, nearly all were modified by the combination of sialidase and galactosidase. The Tanimoto distributions for both tryptic and chymotryptic digestions for the A1AG1 protein ( Fig. 2A and 2B) showed a similarity of , 0.01 and , 0.0001 with confidence scores of 466,987 and 6146, respectively. (Internal distributions corresponding to chymotryptic and tryptic AGP versus AGPgal, respectively, are shown in supplemental Fig. S5A and S5B). Site-specific distributions for chymotryptic and tryptic digestions are shown in Fig. 3 and supplemental Fig. S7, respectively. Bar plots showing the standardized mean abundance for glycoforms at each sequon are also shown in these figures. Site 56-NKSV has low coverage in a tryptic digestion compared with the chymotryptic digestion. We identified only three tryptic 56-NKSV glycoforms from AGP and AGPgal, resulting in a very broad null distribution as well as broad internal distributions (supplemental Fig. S8), a confirmation that the data quality was poor for this site. The confidence score for tryptic 56-NKSV was 6. Internal distributions for chymotryptic AGP and AGPgal comparisons are shown in supplemental Fig. S6.
A/Philippines/2/1982 (Phil82) is a more complex and biologically relevant example with which we demonstrate the robustness of the Tanimoto metric. IAV glycans are largely asialo because the neuraminidase (NA) envelope protein cleaves sialic acid residues so that newly formed virions can bud off from the host cell's plasma membrane. Again, we modified the Phil82 HA glycans with b-gal to remove any terminal galactose residues; we denote this sample as Phil82gal. We know from our previous work on site-specific characterization of Phil82 (24) that some Phil82 glycosites have primarily high-mannose N-glycans. These high-mannose glycans would not be cleaved, and therefore, the sites with high-mannose glycans would retain some degree of similarity before and after b-gal treatment, depending on the abundance of the high-mannose glycans. By contrast, Phil82 glycosites occupied by complex N-glycans that have terminal galactose residues would be cleaved and show low similarity after cleavage by b-gal. The Tanimoto plot comparing trypsin-digested Phil82 and Phil82gal data sets for the HA protein as a whole showed a moderately high similarity of 0.62, with confidence score 7459 (Fig. 2D). However, the chymotryptic comparison showed a wide null distribution, which suggests data quality issues. In this particular case, because of poor LC performance, we were not able to acquire three replicates for chymotryptic Phil82gal using the same LC unit. Because our Tanimoto algorithm required a minimum of three replicates, the third technical replicate was acquired using a different LC unit. Although the confidence score was still above 100 (Fig. 2C), we opted to discard the entire Phil82 versus Phil82gal chymotryptic data set because of the known inconsistency of LC in the data collection. Consequently, we  performed all subsequent analyses using only the tryptic data set. Internal distributions for Phil82 v Phil82gal are shown in supplemental Fig. S5C and S5D. Total ion chromatograms for this and all other comparisons were made using Thermo Xcalibur Qual Browse 2.2 and are shown in supplemental Figs. S9, S10, and S11.
Similarities for the six glycosylation sites that were observed in the tryptic Phil82 versus Phil82gal comparison and their confidence scores can be found in Fig. 4, along with bar plots comparing the standardized mean abundance for all identified glycoforms. These comparisons reflected the proportion of high-mannose N-glycans that occupy these sites. Sites 144-NNSF and 483-NGTY had high proportions of complex N-glycans in Phil82, and the similarities to their corresponding sites in Phil82gal were , 0.01 and 0.03, respectively. Sites 165-NVTM and 285-NGSI had high proportions of high-mannose N-glycans, and consequently, the similarities were high for these two sites, at 0.98 and 0.76, respectively. Site 38-NATE, having a mixture of both high-mannose and complex N-glycans, had a similarity in the middle at 0.69. Site 246-NSTG had high-mannose glycans and an observed similarity of 0.70, but poor data quality resulted in very broad distributions. As shown in supplemental Fig. S12, the internal quality plot for Phil82 at the same site could not be made because there was only one glycoform identified in the Phil82 sample for this site, and as a result, the confidence score for this comparison was 0. Additional experimental validation is needed to make a confident assessment of the similarity at this site.
Similarity between IAV Pairs-A/Switzerland/9715293/2013 (SWZ13) is a highly-glycosylated H3N2 strain used in the 2015-2016 influenza vaccine (41,42). It has 12 putative Nglycosylation sites. We examined the WT SWZ13 and a mutant (denoted 5B8), both expressed in embryonated chicken eggs, for this paper. Mutant 5B8 HA has three amino acid substitutions from the WT, Q132H, Y219S, and D225N. None of these substitutions adds or deletes a glycosylation site. To address the question of the extent to which these changes Glycopeptide Similarity affect the conformation of the HA trimer in changing the accessibility of glycan-processing enzymes, we compared the Tanimoto similarity plots for trypsin and chymotrypsin digested virus, respectively. However, like the chymotryptic Phil82 data discussed above, one replicate from the chymotryptic WT SWZ13 HA set was run on a different LC unit, causing data quality issues. The null and test distributions for this comparison overlapped almost completely and were multimodal, indicating clustering of replicates ( Fig. 2C and supplemental Fig. S4C). Therefore, only the tryptic comparisons will be further discussed. Taking together all observed glycosites for the trypsin-digested mutant 5B8 versus WT HA comparison, the Tanimoto similarity between the two was relatively high with a value of 0.77 (Fig. 2F). The confidence score was 553, with well-resolved null and test distributions as well as sharp internal distributions, demonstrating that even variations that do not change any of the glycosylation sites can still result in a measurably distinct population of glycans and glycan abundances throughout the protein. Internal distributions for the mutant 5B8 versus WT HA comparison are shown in supplemental Fig. S5F.
We examined which particular sites contributed most to the overall similarity between mutant 5B8 and WT SWZ13 HA glycoprotein. Among the individual glycosites and the corresponding internal quality plots ( Fig. 5 and supplemental Fig.  S13), 133-NGTS had a confidence score of 0 because there was only one glycoform identified for WT HA and the internal quality plot could not be made for this site. For all other sites, internal quality plots and confidence scores indicated sufficient data quality, and the null and test distributions were well resolved, with the exception of 165-NVTM. The low confidence score for 165-NVTM was because of the broad internal distribution for mutant 5B8 HA (ratio = 48); however, the overlap between the test and null distributions was 0. For this site, therefore, we conclude that although the data quality can be better, the near-perfect resolution between the null and test distributions means that we can be confident in this observed similarity. Four sites had relatively high similarity (144-NSSF, 246-NSTG, 285-NGSI and 483-NGTY), whereas 165-NVTM had relatively low similarity (Table II).
These data confirmed our expectation that the glycosylation site on the stalk region of HA (285-NGSI and 483-NGTY) exhibit a higher degree of glycosylation similarity and those on the globular head domain of HA (144-NSSF, 165-NVTM, and 246-NSTG) exhibit variable similarity because the head receives more immune pressure. Antibodies specific to the antigenic regions of the head domain drive rapid genetic mutation, amino acid substitution, accumulation of new sequons in those regions, whereas the stalk remains relatively stable (43). We measured a similarity of 0.83 and 0.86 at 285-NGSI and 483-NGTY, respectively, both in the stalk region, which confirmed our expectations. However, the three sites in the head domain present a mixed range of similarities: 144-NSSF and 246-NSTG have high similarity (0.89 and 0.84, respectively), whereas 165-NVTM has a similarity of 0.47. Receptor binding sites are also located on the head region, and these sites are conserved across all strains of H3N2 IAV to preserve viral recognition of host receptors. These three head domain glycosylation sites are all located within or near the receptor binding site and known antigenic sites (43), which means that there is competing pressure at these sites to retain the ability to recognize host cell receptors while promoting variation of its amino acid sequence to evade host cell immunity. Our data suggest that these competing pressures are reflected in the degree of similarity of glycosylation, in addition to whether the amino acid sequence changes or not. We observed that 144-NSSF and 246-NSTG had high glycosylation similarity, which suggests that these sites may be contributing to the architectural stability of the receptor-binding site, whereas 165-NVTM, which showed low glycosylation similarity between our two strains, may be exposed to antibody pressure. 165-NVTM has also been shown to bind to surfactant protein D, a collectin molecule that is part of the innate immune system (24,44). Exposure to innate immune pressure may also be driving glycosylation variation at this site. As shown in the barplot for this site in Fig. 5Cii, the main difference that contributes most to the low similarity was in the abundances of the Man8-GlcNAc2 and Man9GlcNAc2 glycoforms. The mutant 5B8 HA has lower abundance of Man9GlcNAc2 compared with WT, and higher of Man8GlcNAc2, suggesting that the 5B8 mutations changed the conformation of the HA trimer enough to increase the accessibility of the glycan processing enzyme ER class I a-mannosidase (ERManI) (24,45).
Although we cannot comment further on the chymotryptic results because of inconsistency between replicates, as discussed above, it is worth noting that the chymotrypsin digestion resulted in identification of glycosylation sites 8-NSTA and 38-NATE in addition to the six that were observed with trypsin. In future experiments, it may be worth considering that certain glycosylation sites are only accessible by chymotrypsin digestion Improving Acquisition Methods-We explored various MS methods tailored for glycoproteomics to achieve the highest reproducibility of data to inform most confidence in similarity calculations. Higher energy collisional dissociation (HCD) suffices for assignment of singly-glycosylated peptides (39).
TopN data-dependent acquisition (DDA) is a common strategy for analysis of peptides. Although this strategy is good for discovery of glycopeptide glycoforms, the stochastic nature of precursor selection by intensity results in replicates with a high number of missing values, limiting our ability to calculate similarity with high confidence. Furthermore, because glycopeptides are low in abundance compared with the unglycosylated peptides in the mixture, glycopeptide precursors often go unselected, further contributing to the missing value problem. Low reproducibility of DDA, therefore, necessitates exploring other acquisition options. We attempted parallel reaction monitoring (PRM) to maximize technical reproducibility, albeit at the expense of limited coverage of glycopeptide precursor ions in a given time window. PRM precursor ion selection is strictly deterministic. Precursors selected for fragmentation are programed by the user before acquisition and are the same for every replicate of the sample, unlike DDA whose precursors are selected stochastically by intensity. The quantification for PRM is done at the MS2 level, using highly informative peptide-Y fragment ion series, imbuing higher confidence to the assignments and to the quantification. However, the capability to discover previously unidentified glycoforms is lost, and there is a limit to how many PRM targets can be selected because of the duty cycle of the mass spectrometer. As a result, fewer glycopeptides were subjected to tandem MS using PRM than DDA, resulting in unsatisfactory ability to compare complete glycosite glycoform distributions among biological samples, which is why we relied on aggregating deconvoluted MS1 precursor ions to determine glycopeptide abundances for this article. CONCLUSIONS Because the composition of the distribution of glycans that occupy a glycosite cannot be predicted from genomic data, quantification of glycopeptides from MS2 fragment ions is not straightforward and requires specialized bioinformatics tools. Quantitative proteomics methods including Skyline (46) allow a range of fixed chemical modifications to proteins, including phosphorylation, acetylation, methylation and acylation, among others (47)(48)(49)(50)(51). The landscape for quantitative glycoproteomics software, which must be able to handle a distribution of glycoforms at each glycosite, is far more limited.
We show that rigorous comparison of glycosite similarity requires careful consideration of glycoproteome complexity. The use of a measured glycome is considered preferable as a means of minimizing the number of glycan compositions used to calculate glycopeptide null distributions relative to use of combinatorial lists or entire glycan databases because the signal would not be split for individual glycans across multiple peptides, the the PPM error tolerance would be tighter for glycans than for glycopeptides with larger absolute mass, and the glycomics experiments could be accomplished with relatively short gradients because there would not as many analytes to separate. In the absence of Glycopeptide Similarity glycomics data, the list of glycans identified from the glycoproteomic data may be used as an alternative. Data confidence would be highest if it were possible to quantify all glycopeptide compositions using tandem MS. However, both DDA and PRM lack the speed necessary to quantify all such glycopeptide compositions for moderately complex glycoproteins such as HA, meaning that it is necessary to use MS1 peak abundances for which tandem MS confirmation of identity is not available. As the fraction of glycopeptides identified from MS1 peaks lacking supporting MS2 identifications in-creases, the overall confidence in the similarity decreases. We are therefore interested in the development of data-independent acquisition (DIA) methods for glycopeptides, which would in theory maximize reproducibility while allowing for the discovery of unknown glycopeptides, and providing more accurate quantification. In principle, the use of DIA for moderately complex glycoproteins such as HA would provide an advantage for similarity comparisons only if there were sufficient ability to select among the several related glycopeptides that elute in a narrow chromatographic retention time window. This is a topic for future work.

DATA AVAILABILITY
The MS data have been deposited to the ProteomXchange Consortium via the PRIDE partner repository with the data set identifier PXD018920 and DOI 10.6019/PXD018920.
The software used in the Tanimoto bioinformatics analysis is freely available upon request.