Glycomics, Glycoproteomics, and Glycogenomics: An Inter-Taxa Evolutionary Perspective

Abstract Glycosylation is a highly diverse set of co- and posttranslational modifications of proteins. For mammalian glycoproteins, glycosylation is often site-, tissue-, and species-specific and diversified by microheterogeneity. Multitudinous biochemical, cellular, physiological, and organismic effects of their glycans have been revealed, either intrinsic to the carrier proteins or mediated by endogenous reader proteins with carbohydrate recognition domains. Furthermore, glycans frequently form the first line of access by or defense from foreign invaders, and new roles for nucleocytoplasmic glycosylation are blossoming. We now know enough to conclude that the same general principles apply in invertebrate animals and unicellular eukaryotes—different branches of which spawned the plants or fungi and animals. The two major driving forces for exploring the glycomes of invertebrates and protists are (i) to understand the biochemical basis of glycan-driven biology in these organisms, especially of pathogens, and (ii) to uncover the evolutionary relationships between glycans, their biosynthetic enzyme genes, and biological functions for new glycobiological insights. With an emphasis on emerging areas of protist glycobiology, here we offer an overview of glycan diversity and evolution, to promote future access to this treasure trove of glycobiological processes.


In Brief
This review article i) assesses the utility of current glycomic, glycoproteomic, and glycogenomic methods to characterize protein glycosylation in less-well-studied eukaryotes; ii) assembles a plausible evolutionary lineage of eukaryotic glycan-protein linkages from the last eukaryotic common ancestor through protists to multicellular plants, invertebrates, and vertebrates; and iii) highlights the diversity of peripheral glycan specializations and modifications with an emphasis on available information from diverse protist kingdoms and invertebrate animals.

Glycomics, Glycoproteomics, and Glycogenomics: An Inter-Taxa Evolutionary Perspective
Christopher M. West 1,* , Daniel Malzl 2 , Alba Hykollari 2,3 , and Iain B. H. Wilson 2 Glycosylation is a highly diverse set of co-and posttranslational modifications of proteins. For mammalian glycoproteins, glycosylation is often site-, tissue-, and species-specific and diversified by microheterogeneity. Multitudinous biochemical, cellular, physiological, and organismic effects of their glycans have been revealed, either intrinsic to the carrier proteins or mediated by endogenous reader proteins with carbohydrate recognition domains. Furthermore, glycans frequently form the first line of access by or defense from foreign invaders, and new roles for nucleocytoplasmic glycosylation are blossoming. We now know enough to conclude that the same general principles apply in invertebrate animals and unicellular eukaryotes-different branches of which spawned the plants or fungi and animals. The two major driving forces for exploring the glycomes of invertebrates and protists are (i) to understand the biochemical basis of glycan-driven biology in these organisms, especially of pathogens, and (ii) to uncover the evolutionary relationships between glycans, their biosynthetic enzyme genes, and biological functions for new glycobiological insights.
With an emphasis on emerging areas of protist glycobiology, here we offer an overview of glycan diversity and evolution, to promote future access to this treasure trove of glycobiological processes.
Up to half of all eukaryotic proteins are expected to be glycosylated, and while most are secreted or membranebound, many are also nucleocytosolic or even mitochondrial. Over the past decades, a range of linkages between sugars and proteins have been discovered. The initial transfer of sugar is determined by primary, secondary, or even sometimes tertiary determinants in the polypeptide structure as well as by the "glycogenomic" capacity of an organism. The glycomes of unicellular, fungal, plant, and animal species diverge highly, although many common elements are found. Several classes of protein glycosylation, defined by the linkage of the reducing terminus to the amino acid, can be traced throughout the protist kingdom and therefore inferred to occur in the last eukaryotic common ancestor (LECA). Of these, N-glycans, GPI anchors, O-GlcNAc and O-Fuc, were retained by higher plants, and N-glycans, GPI-anchors, mucin-type O-glycans, and O-GlcNAc were retained in metazoa. Others had their origins within a protist clade, some of which persisted into plants or animals; others, such as glycosaminoglycans linked to Xyl, appear to be a metazoan "invention." Here we probe into current knowledge of the evolution of these glycans, except for glycosaminoglycans and GPI-anchors, which have been examined elsewhere (1, 2); in Figure 1 we map different linkages to a phylogenetic tree, whereby ongoing refinements of the protist tree of life are converging on a cladogram whose two major branches provided the plants and animals (3).

NATURE SHUFFLES THE PACK
In mammals, there are nine basic monosaccharide building blocks, and many of these "cards" are also constituents of nonmammalian glycans. Generally, invertebrates lack sialic acid, except for minimal presence in arthropods and wider occurrence in echinoderms (4); in protists, its presence on parasite glycans is due to grabbing this sugar from host structures (5). In plants, N-glycans containing core L-fucose and bisecting xylose are very conserved, while their O-glycans and cell wall polysaccharides are highly diverse (6). Other than Galf in some species (7,8), there are few reports on "noncanonical" monosaccharides in the protein-linked glycans of invertebrates, fungi, or protists-but nonsugar substituents add to the variety, as summarized in Figure 2 (9)(10)(11)(12)(13)(14)(15)(16)(17)(18)(19)(20)(21)(22)(23)(24)(25). Examples include sulfation, phosphorylation, methylation, pyruvylation, or zwitterionic moieties (4), many of which are not found in mammalian glycans. There are many holes in our knowledge about protein-linked glycans (we will not even start with glycolipids, lipoglycans, or glycolipid anchors!), but our own and other groups have collectively analyzed glycomes from various protists, fungi, and invertebrates, including model organisms, parasites, hosts for parasites, or species of biotechnological relevance.   Our knowledge of glycomes is only as complete as the methodology to define them. Mass spectrometry is currently the major approach and results in sets of m/z ratios whose structural basis must be interpreted (26,27). Unlike normally linear peptide chains, whose components have generally different masses, the building blocks of glycoconjugates often have the same mass. Furthermore, the ability of sugars to form branched structures leads to a high theoretical complexity, although conserved biosynthetic pathways of, e.g., N-glycans mean some aspects of the structures are accepted without question. The result is the presence of isomeric and isobaric glycans that are difficult to distinguish-some combinations have exactly the same mass as the atomic composition is identical, others need a highly accurate instrument to be distinguished; in other cases, correct isotopic peak picking or charge state definition must be carefully done. In an era of increasing use of computational annotations, there is the danger that glycan structures can be completely misassigned. Thus, it is useful to consider which combinations of sugar and nonsugar modifications may yield ambiguous masses.
Years ago, it was difficult to distinguish the occurrence of methylated fucose (+160 Da), hexose (+162 Da), or phosphorylcholine (+165 Da) on a linear MALDI-TOF MS. Thankfully, the isotopic resolution of reflectron instruments enables such combinations to be better resolved, but the pattern of isotopic m/z values should be critically assessed to see whether two glycans of slightly different mass are present in a given sample, never mind that a hexose can be a mannose, a glucose, a galactose, or a galactofuranose. Only the highestresolution instruments can differentiate phosphate and sulfate (Δm/z respectively of 79.9663 and 79.9569 Da), and also methylated hexose and GlcA are very similar (Δm/z 176.0685 versus 176.0321 Da).
When considering combinations of two building blocks or sialic acids, then the numbers game becomes more complicated: a difference of 291 or 307 Da is indicative of either Neu5Ac or Neu5Gc. However, 292 or 308 Da can be, respectively, due to two fucoses or a fucose and a hexose (more bizarre combinations are also possible). Even the "humble" addition of 324 Da may not be due to two hexoses (2 × 162 Da), but to the presence of an N-acetylglucosamine and a methyl-2-aminoethylphosphonate (i.e., 203 + 121 Da). Occasionally unhelpful is the difference in m/z of sodium or potassium adducts, as 16 Da is also the Δm/z for deoxyhexose and hexose.
Naturally, at higher masses, more combinations become possible. Consider Hex 5 HexNAc 4 Fuc 1 as found in a typical mammalian biantennary N-glycan with two terminal β1,4galactose and one core α1,6-fucose residues. The corresponding mass, 1786 Da, is identical to a biantennary glycan with terminal β1,3-galactose residues or a hybrid glycan with one LacdiNAc-modified antenna. It can also correspond to a Man 5 GlcNAc 2 modified with a bisecting and intersecting GlcNAc and a core α1,3-fucose if isolated from Dictyostelium. We have previously described other examples of unusual isobaric glycans containing methyl-2-aminoethylphosphonate, methylated hexose, or β-mannose on the reducing GlcNAc (28).
Some of these structures can be distinguished by analysis in both positive and negative modes or employment of chemical or enzymatic treatments. For instance, a glycan with GlcA is not only sensitive to some glucuronidases, but is often detectable in both positive and negative ion modes as is a phosphorylated glycan. Sulfated glycans are often only detected as their molecular ions in negative mode, but tend to suffer from source loss especially in positive ion mode (4). Glycans with a phosphate, phosphonate, phosphodiester, α1,2/3/4-linked fucose, or Galf will lose these units to a greater or lesser extent upon treatment with hydrofluoric acid, which will also lactonize α2,3-linked sialic acid (4,29). Fractionation of native glycans by porous graphitized carbon (PGC) or HILIC and/or fluorescent labeling combined with RP-or NP (normal phase)-HPLC prior to MS enables enrichment of less abundant structures or separation of many isobaric/isomeric forms (28,30). Similarly, RP-HPLC separation of permethylated glycans is useful for analyzing isomers (31), but permethylation must involve sulfation-friendly purification methods (32) and may result in loss of zwitterionic glycans. Linkage information via cross-ring fragmentation (33), GC-MS, or NMR data also aids structural definition. Certainly, even a composition based on high-resolution mass spectrometry should be backed up by orthogonal proofs (34,35), especially if novel.

N-GLYCOSYLATION
The GlcNAcβ1-Asn linkage is conserved throughout eukaryotes; the relevant rER oligosaccharyltransferases (OSTs) have their origins in prokaryotes and archaea, though different sugars are transferred. The canonical eukaryotic Glc 3 Man 9 GlcNAc 2 -donor structure appears to have been present in the LECA, but truncated versions are found in various protist taxa, evidently the result of gene loss (36).
which there is a lack of experimental data are not shown. Organisms that branched from the lineage that gave rise to the higher plants are referred to as Group 1, whereas those that gave rise to animals are classified as Group 2. Linkages are inferred to occur in the LECA if they are found in both groups of protists, but the absence of a linkage might be a result of incomplete information. The origin of linkages inferred to originate after the LECA is shown at the relevant branch point. Notes adjacent to linkages indicate names by which they are commonly referred. GPI anchors, a specialized glycolipid linked to protein C termini typically via a phosphoethanolamine linker to a nonreducing terminal mannose, was likely present in the LECA (not shown). SNFG symbols for sugars are summarized at the bottom. See text for explanations.   Thereby, a reduction in the number of mannose residues, and the addition of novel mannose residues, influences the variety of α-mannosidase-processed derivatives that appear. In some branches, terminal glucosylation associated with chaperoneassisted folding is diminished or absent (37).

N-glycan modifications
It is now clear that "lower" animals are capable of synthesizing rather complex N-glycan structures (38). Fucosylation, sulfation, glucuronylation, methylation, or phosphodiesters are rather widespread as antennal modifications, but also galactose or GalNAc can be present. Thereby, mollusks often have methylated glycans, but species-specific differences include antennal methylated blood group A motifs or disubstituted fucose residues (39)(40)(41). In nematodes, cestodes, and lepidoptera, phosphorylcholine occurs, but in the honeybee, phosphoethanolamine is found (38,42). On the other hand, schistosomes rather "play" with fucose (43). Some marine organisms are rich in sulfated glycans, whether they be mollusks or echinoderms (39,44). Sulfation is found in insects too, but the positions vary, and we have found sulfate on mannose, fucose, GlcNAc, and galactose residues in different invertebrates (39,44,45). GlcA may be the nearest equivalent to sialic acid in many insects (42) and is also found on the N-glycan antennae of (at least) one nematode (46). Sialic acid is undoubtedly present on N-glycans of Drosophila (47) and echinoderms (29).
The core regions of N-glycans can be highly decorated, which means that there is more than just core α1,6fucosylation, but also galactosylation or sulfation of the α1,6-fucose, core α1,3-fucosylation of either GlcNAc of the chitobiose unit or β-mannosylation of the reducing terminus (38,42). Indeed, difucosylation of this core GlcNAc is a rather common feature of invertebrates (38). Additionally, the core mannose in schistosomes and some molluscs can carry a xylose as in plants (48). Caenorhabditis elegans intriguingly possesses the most extremely modified N-glycan cores, of those found to date, with multiple fucose and galactose residues (49).
In the protist realm, the diversity is no less bewildering, with smaller versions all the way down to the GlcNAc 2 core being transferred from the lipid-linked donor in various protist taxa (50,51). Glc residues that are typically removed after transfer to protein in animals are sometimes retained (52,53), suggesting the absence of a role in quality control (37). Although the metazoan enzymes for processing hybrid and complex forms appear absent, GlcNAc-based antennae reminiscent of complex forms are found in various protists and fungi (Fig. 2), and more examples are likely (e.g., (54)(55)(56)). Atypical GlcNAcTs have been identified in Trypanosoma brucei (57,58), a finding that shows that GT families can include "reassigned" regio-specificities, which complicates glycophylogenetic comparisons based on genome analyses. Further antennal modifications include long and even branched polymannose chains, atypically linked neutral sugars, poly-LacNAc chains, Galp, Galf, Xyl, and a variety of anionic substituents including methylphosphate, sulfate, and pyruvate ( Fig. 2A). On the other hand, in some protist species, there are modifications of the core with xylose and fucose as in other taxa. The detection of many of these structures historically involved release by endoglycosidases, radiolabeling, exoglycosidase digestion, co-chromatography, and when possible, NMR; the advent of mass spectrometry coupled with prior knowledge of structures and the specificities of the enzymes that assemble them has allowed us to go a very long way to define structures, but often correlation with the presence or involvement of glycosylation genes is necessary for explication, as illustrated by a recent study in Toxoplasma gondii (53). However, the diversity of structures that exist is still underestimated, because of limitations in enzymatic release or their ionization and fragmentation in the mass spectrometer. Furthermore, when applied to whole-cell homogenates, these methods tend to favor the detection of the most abundant glycans leaving open the question of rarer protein-specific variants.
The orthogonal approach of analyzing peptides by mass spectrometry can help address these limitations. Typical proteomics workflows, involving proteases such as trypsin, pepsin, Pronase, or proteinase K and nano-RP-HPLC-MS, will measure the mass of N-glycosylated peptides and present a preliminary structural assessment based on prior glycomic knowledge. This of course also provides site-specific information, and the near universal use of the N-sequon (NxS/T/ sometimes C, x not P) is a valuable predictor. Notably, T. brucei has two OSTs with distinct preferences for different lipid-linked donors and flanking residues preceding the Nsequon (59). In practice, suppression of glycopeptides during ionization and low abundance of isoforms owing to glycan microheterogeneity usually mitigate against success. These limitations are partially overcome by scanning for sugar fragment ions (e.g., characteristic oxonium ions) or preenrichment of glycopeptides by HILIC or lectins; however, caution is needed as these introduce bias due to selectivity for glycan types. Prepurification of a target protein if sufficient natural material is available and top-down approaches enable a more comprehensive profiling of glycan types, attachment sites, and site-specific heterogeneity.
Glycoproteomic studies of nonmammalian eukaryotes are certainly underdeveloped, with the most work directed toward plants such as Arabidopsis and model organisms such as C. elegans (60, 61), but also some data on invertebrate and protist parasites. Using lectin pre-enrichment of glycoproteins or glycopeptides, 141 N-glycan sites were characterized in the parasite T. brucei (59), more than a thousand N-glycan sites were more recently identified on many hundreds of proteins in Trypanosoma cruzi (62), and over 100 were described in an older study in T. gondii (63). In a single Dictyostelium discoideum cell surface glycoprotein, 14 N-glycan types detected by MALDI-TOF-MS were mapped to 15 of 18 predicted Nsequons in HILIC-or ConA-enriched peptides, and notably, substantial but differential microheterogeneity was detected at all 15 sites (64). Lectin enrichment was also employed to explore the N-glycoproteome of an algal species (65). Because of the sheer magnitude of the glycoproteomic inventory, it is often prudent to pursue glycan features based on a biochemical or cellular function of interest. This can be explored in various ways, including mutating the biosynthetic enzyme gene or the attachment site, use of glycan hapten inhibitors, selective enzymatic or chemical perturbation of target glycans, selective steric blockade of glycans with lectins or carbohydrate binding proteins, etc. Orthogonal approaches are typically required to circumvent non-specific effects of these methods toward the glycan of interest. However, as the N-glycans of protists are generally sufficiently similar to those of animals, yeast, or plants, lessons learned from multicellular organisms can be applied. With improvements in the ability to predict GTs and processing enzymes, genomic predictions can also be brought to bear on the challenge (37,53,(65)(66)(67)(68), with the caveat that many genes especially in protists have unknown functions.

O-GLYCOSYLATION
As in mammals, O-glycans occur in many different "flavours" in invertebrates and protists. Typical are the mucin-type O-glycans with αHexNAc attached to serine or threonine. These cell surface and extracellular structures often conserve negative charges with different sugars or substituents. More specialized are the short, usually neutral glycans of folded epidermal growth factor (EGF) and thrombospondin repeat (TSR) domains, whereas very widespread are monosaccharide modifications of nucleocytosolic proteins. As O-glycans are less easy to release and recover than N-glycans, we have far less knowledge about their structural diversity, but the origin of some can be traced back to extant members of the holozoan and choanoflagellate predecessors of invertebrates and others through various protist lineages all the way back to the LECA (69). The sugaramino acid linkages are the most primal and can be mapped onto the evolutionary tree (Fig. 1), whereas the peripheral sugars in protists show such diversity that we simply summarize their nature (Fig. 2B).

MUCIN-TYPE O-GLYCOSYLATION
In animals, including invertebrates, mucin-type O-glycosylation is initiated by a family of up to 20 Golgi-associated CAZy GT27 polypeptide (pp) αGalNAcTs (70), often resulting in highdensity arrays on mucin domains rich in Ser and Thr residues. In protists, these are typically initiated with GlcNAc rather than GalNAc, a difference that may correlate with the evolutionary emergence of a UDP-GalNAc epimerase activity (71). Based on studies in Dictyostelium and T. cruzi, a new family of related pp-αGlcNAcTs (CAZy GT60) was recognized that varies mainly in the absence of the characteristic C-terminal ricin-like targeting domain found in almost all metazoan pp αGalNAcTs (72)(73)(74). Up to four pp-αGlcNAcT genes are found in the genomes of both Group 1 (diatom, oomycete and algae) and Group 2 protists (amoebozoa, T. cruzi), indicating its occurrence in the LECA and suggesting target specialization as occurs in the metazoan pp αGalNAcT family. An exception occurs in apicomplexans, where up to five pp-αGalNAcTs are found (75,76); owing to their absence in other protist genomes (74), these are likely the result of lateral gene transfer (74) perhaps facilitated by residence in animal host cells. Consistent with this hypothesis, the Toxoplasma O-glycans are short, consisting of either a single αGalNAc (known as the Tn antigen in humans) or a HexNAc-αGalNAc (53).
Mucin-type protist O-glycans have been particularly well characterized in T. cruzi where they are found in extensive arrays on cell surface GPI-anchored mucins. They are extended by αor β-linked Galp and Galf residues in linear and branched chains, with a prevalence of 4-, 2-, and 6-linkages (77), as compared with the β3and β6-linkages typical of animal O-glycans. The βGalp termini are frequently capped with α3-linked Gal, or sialic acid derived from host glycans by parasite transialidases (5), creating O-glycans with general resemblance to host glycans but distinct in detail. Dozens of isoforms were defined by NMR (77), with recent MS profiling (31,62) confirming the high variation among T. cruzi strains. Interestingly, related GT genes are present in other trypanosomatids but the O-glycans have not been detected biochemically (73), indicating a limitation of global glycomic approaches. The amoebozoan Dictyostelium also extensively modifies its mucin-type domains with O-αGlcNAc, contributing to the modB-epitope (31), with evidence that some are capped with αFuc (78). The mucin-type O-glycans of protists in Group 1 remain to be characterized.

O-MAN
Another sugar that initiates O-glycans on Ser and Thr residues is α-mannose. In fungi, this linkage replaces the common mucin-type HexNAc-linkage (79) and is linearly extended by further mannoses or Galf. Related protein O-mannosyltransferases (POMTs) are present in most animals, where the Man can be extended by GlcNAc, Gal, and sialic acid in vertebrate mucin domains. A specialized phosphorylated subset carries matriglycan, which modifies α-dystroglycan in mammalian muscle and brain (80). A glycoproteomics study discovered nonextended O-Man on mammalian adhesion proteins, transferred by a separate family of CAZy GT105 TMTC mannosyltransferases in the rER (81). Bioinformatics searches suggest that the POMTs originated in fungi, though a distant homolog in bacteria suggests a more primitive origin (82), while TMTC-like genes are rather widely distributed in both Group 1 and 2 protists and thus probably in the LECA, and potentially explain the origin of a dimannose species detected in T. cruzi (56). These are short, often neutral, one-to-four-sugar glycans on EGF and TSR domains of animal proteins. O-fucose was originally detected by mass spectrometry of human urine samples and later on EGF and TSR repeats of Notch and thrombospondin (83). O-fucosylation is initiated on folded EGF domains in the rER by POFUT1 (CAZy GT68) and can be extended out to a tetrasaccharide capped by sialic acid. EGF repeats are also modified by a small family of GlcTs (84) that can also use UDP-Xyl as a donor (85), and the Glc can be extended by xylose in animals (86). Extracellular β-linked O-GlcNAc is another modification of animal EGF domains, but does not appear to be extended. It is assembled by the CAZy GT61 family member EOGT (87) that is unrelated to αGlcNAcT and βGlcNAcT enzymes described above and below. Some related enzyme sequences in protists and plants have distinct activities.

Evolution of Eukaryotic Glycans
The O-Fuc found on folded TSRs is applied by POFUT2 and can be capped by a β3-linked Glc assembled by a CAZy GT31-family GT (53). This disaccharide and both GTs are also found in T. gondii and Plasmodium (agent for malaria) (88), but the genes are otherwise not found in protists. C-Man is a monosaccharide modification of Trp residues typically associated with TSR repeats in metazoan proteins (89). The C-ManT resides in the rER and utilizes Dol-P-Man as the donor and is also present in Toxoplasma and Plasmodium. C-ManTlike sequences, which belong to CAZy GT98 and the larger GT-C superfamily of Dol-P-sugar-dependent GTs (90), are seen in the broader group of alveolates, and in metazoan progenitors (choanoflagellates and holozoans), but not elsewhere. Thus, C-ManT, POFUT2, and the β3GlcT were potentially, like the pp-αGalNAcTs, acquired by lateral gene transfer (Fig. 1).
O-Fuc has also been reported on secreted proteins in Dictyostelium (78,91,92). O-Glc has been found in Trichomonas vaginalis, where it can be extended with predominantly Glc (93), whereas Glcα-Ser, sometimes extended with additional hexoses (94), was recently discovered on T. brucei VSG surface proteins by crystallography and confirmed by mass spectrometry. The enzymatic basis for these protist versions is unknown.  (95). Two major types are found on extensins: a monosaccharide modification consisting of αGal and oligoarabinosyl modifications on adjacent Hyp residues initiated with β-L-Ara. In addition, Gal-Ser and pentose-Hyp linkages have also been detected by mass spectrometry (96). O-glycosylation of arabinogalactanproteins is much more complex, being initiated by a βGalT and consisting generally of a β3-galactan backbone to which are appended β6-linked Gal chains modified by a variety of monosaccharides including GlcA, resulting in immense structures that dwarf the carrier protein (95). The initiating GTs and some of the extending GTs have been recently identified, and their sequences will help determine the evolutionary origins of these modifications.
Hyp-linked glycans are also found in the green alga Chlamydomonas as short linear oligosaccharides of α-L-Araf, as determined by NMR and genomics analyses (97,98). Sequences related to the CAZy GT31 Hyp:βGalT are found in other Group 1 protists including a Cryptophyte and Stramenopiles, supporting evidence for arabinogalactans in Fucus (99). Similarly, the initiating Hyp:βAraT and Ser:αGalT are related to CAZy GT8 and GT96 family members in other protists. Because Hyp-linked O-glycans (as well as hydroxylysine (Hyl) O-glycans as in collagen) resist release by β-elimination (100), searches for their phylogenetic distribution will likely require glycoproteomic aided by glycogenomic approaches.

SUGAR PHOSPHO-LINKAGES
Phosphodiester-linked glycans originate by transfer of a sugar phosphate, rather than a sugar alone, from sugar nucleotides to Ser or Thr, and are subject to extension in the Golgi by various GTs or other phosphoglycosyltransferases. They are amenable to release as anionic glycans using nonreductive β-elimination and analysis by negative ion mode mass spectrometry, though sugars with 2-OH α-manno configuration are susceptible to release as free sugars, or by aqueous HF, which preferentially cleaves the sugarphosphate linkage (28). The simple structures GlcNAcα1-PO 4 -, Fucα1,3GlcNAc-PO 4 -, and Fucβ1-PO 4 -have been identified in the amoebozoan Dictyostelium (91,92). Another amoebozoan, Entamoeba histolytica, assembles a phosphodiester-linked Glcα1,6Galα1-PO 4 -onto Ser/Thr residues and is elongated by additional α6-linked Glc residues (101). A phosphodiester-linked tri-α-mannoside has been described in T. cruzi on NETNES (102).
Possibly the most complex protein-linked oligosaccharide has been described on T. cruzi Gp72, which is a repeating phosphodiester-linked tridecasaccharide containing L-Rhap, L-Fucp, D-Galp, D-Galf, D-Xylp, and D-GlcNAcp linked to Thr/ Ser via αGlcNAc-PO 4 (34). In another trypanosomatid, Leishmania, O-glycosylation occurs in the form of long, diverse, and complex (linear or branched) proteophosphoglycans anchored via αMan-PO 4 -Ser/Thr that form extensive networks contributing to a type of biofilm (103). The challenge of detecting and characterizing such complex structures invites speculation that they are more common than assumed. These phosphodiester-linked structures have not been described in metazoa and presumably represent species-specific adaptations that ensure, in the absence of sialylation, a high negative charge density on their cell surfaces. In general, the evolution of phosphosugar structures and the relevant phosphoGTs are highly under-explored.
Phosphodiester-linked sugars also occur as peripheral linkages (Fig. 2). For the pathogenic yeast Cryptococcus neoformans, the phosphoGT that assembles a Xyl-PO 4 -Man linkage has been identified (104), while in Saccharomyces cerevisiae, Mnn6p is a proven mannosylphosphate transferase toward α2-linked mannose and is related to conventional α-mannosyltransferases (105). In Dictyostelium, the enzyme that phosphorylates lysosomal enzymes is orthologous to animal examples (106), and extracts of Euglena showed a similar enzymatic activity (107).

O-βGlcNAc, O-αFuc, AND O-αGlc (NUCLEOCYTOPLASMIC)
O-βGlcNAc is a monosaccharide modification of Ser or Thr residues of thousands of proteins that reside in the cytoplasm and nucleus. Originally discovered in mammals by labeling with β4-galactosyltransferase in conjunction with β-elimination, its distribution can now be mapped directly using ETD mass spectrometry to preserve its sensitive linkage during fragmentation (108). O-βGlcNAc addition is mediated by the action of a CAZy GT41 family member referred to as OGT or Secret Agent (SEC). O-βGlcNAc has a variety of functions in metabolic sensing and regulation, and key to its actions in animals is its removal and therefore cycling by the action of a single O-GlcNAcase known as OGA. Outside of animals, O-βGlcNAc has been most studied in higher plants where genetic studies have elucidated complex roles in modulating plant hormone signaling (109), but the lack of evidence for an OGA suggests that fundamental mechanisms may diverge. OGT-like sequences apparently originated in prokaryotes and are distributed in many Group 1 and 2 protists (110). Its absence in some major groups is most consistent with gene loss, and it might be compensated by other monosaccharide modifications. Given its evident occurrence in the LECA, O-βGlcNAc action may be widespread in protists.
An ancient gene duplication, probably in prokaryotes (110), resulted in a closely related enzyme now known to generate O-αFuc on Ser and Thr. This modification was first characterized on AAL-enriched nucleocytoplasmic proteins of T. gondii using mass spectrometry (111). The O-αFuc modification has been detected on dozens of T. gondii proteins, multiple Dictyostelium proteins, and two FG-repeat nuclear pore proteins of Cryptosporidium ( (112), unpublished). Known as Spy or OFT, the enzymatic activity was first described in Arabidopsis (113) where prior genetic analyses had demonstrated roles in modulating hormone signaling (110). Subsequent studies demonstrated OFT activity in T. gondii and a role in promoting growth (53). Plants express both OGT and OFT, but evidence for OFT in animals or their coexistence in protists has yet to be described.
Finally, yeast and animal glycogen often exists as an Oglycan linked at its reducing terminus to a critical Tyr residue within glycogenin, an autocatalytic α-glucosyltransferase that primes glycogen synthesis in the cytoplasm (114). As for Hypand Hyl-linked O-glycans described above, Tyr-linked glycans are not released by β-elimination. Currently thought to occur only in metazoa and fungi, a close homolog and potential evolutionary progenitor of glycogenin contributes to Skp1 glycosylation in Group 1 protists (see below). Distinct nucleocytoplasmic glycans might also exist elsewhere, as suggested by a report of O-Man mono-and oligosaccharides in yeast (115).

HYP (NUCLEOCYTOPLASMIC)
Numerous protists express a complex O-glycan linked to the side chain of 4(trans)-hydroxyproline (Hyp) of a nucleocytoplasmic protein, Skp1, which is an essential subunit of the eukaryote-wide SCF class of E3 polyubiquitin ligases, but is glycosylated only in Group 1 and 2 protists. Skp1 glycosylation appears to promote assembly of SCF subcomplexes (116) by a local conformational control mechanism (35,117,118). Where the structure has been studied, the glycan consists of a linear pentasaccharide composed of GlcNAc, βGal, αFuc, αGlc (in some species), and αGal. The glycan was initially detected in Dictyostelium by metabolic labeling with [ 3 H]Fuc, but structural studies required traditional glycoproteomic approaches as well as analysis of the glycopeptide after permethylation (35), a method not generally recognized for its applicability to glycopeptides (119). Exoglycosidase treatments, characterization of the specificity of the GTs on model substrates, and ultimately NMR were required to establish the structures (35,120). Like many sugar nucleotide-dependent GTs that target proteins in prokaryotes, the Skp1 GTs exist as noncomplexed soluble enzymes in the cytosol and are evolutionarily related to familiar animal Golgi GTs. While the initiating GT for Dictyostelium Skp1 is a pp-αGlcNAcT related to its Golgi counterpart that initiates mucin-type O-glycosylation (72), the next two GT activities of Dictyostelium, Toxoplasma, and Pythium (a β3GalT and an α2FucT) (121) are in the same bifunctional protein and belong respectively to the CAZy GT2 and GT74 families whose members are prevalent in prokaryotes (74). In Toxoplasma, addition of the fourth sugar is catalyzed by a CAZy GT31 family α3GlcT (Glt1) that is related to Golgi αManTs that elongate mannans in fungi and yeasts (122), and the final sugar is applied by an α3GalT (Gat1) whose CAZy GT8 relatives are often found in the cytoplasm. Gat1 is closely related to the aforementioned glycogenin (120).
In Dictyostelium, Glt1 and Gat1 are replaced by an unrelated CAZy GT77 α3GalT (AgtA) that operates twice on the reducing end terminus to also generate a pentasaccharide (123). AgtA is related to an uncharacterized Golgi enzyme in T. cruzi and pectin synthesizing enzymes in plants, but in Dictyostelium, a C-terminal WD40-repeat β-propeller domain both modulates Skp1 activity independent of its glycosylation status and facilitates addition of the second sugar (123). Interestingly, oddeven enzyme pairs of the 6-enzyme Skp1 modification pathway are frequently expressed as fusion proteins in different protist taxa (121, 124), possibly supporting processive processing. The Skp1 enzymes were evidently present in the LECA and might have contributed to the newly emerging secretory pathway compartments via the simple exigency of gene duplication and acquisition of N-terminal signal anchor sequences. Indeed, the Skp1 and Golgi pp-αGlcNAcTs are each encoded by two-exon genes with an intron separating a short N-terminal sequence from the catalytic domain (72,125).
The unusual linkage of the glycan to a Hyp is related to its function in O 2 -sensing in Dictyostelium and Toxoplasma (126). Glycosylation is contingent upon the presence of sufficient ambient O 2 as a substrate for a dedicated prolyl 4-hydroxylase (PhyA) related to the O 2 -sensing PHD2 that regulates HIFα in animals. The evidence suggests that O 2 -sensing had its origins in protists before the enzyme redirected its target from an E3 Ub ligase to the HIFα transcriptional cofactor, whose hydroxylation renders it a target for an evolutionarily related E3 Ub ligase. The genes for this pathway are found in many, but not all, aerobic protists including several human and crop pathogens.

THE BOTTOM LINE
A primary motto in nonmammalian glycomics is "expect the unexpected." Therefore, if you are more used to analyzing mammalian glycans, a database-centered approach to annotation can yield misleading results. Keeping an open mind, one can begin with separating different classes of glycans using different PNGases and/or fractionation into neutral, anionic, or hydrophobic pools, whereas O-glycan release must be done chemically or at the peptide level for linkages to Hyp, Hyl, or Tyr. Chromatography is highly useful for separating isomeric and isobaric structures before mass spectrometry, whereby chemical and exoglycosidase treatments as well as defined standards ease interpretation. Often, orthogonal information from lectin or antibody recognition or dependence on a GT with characterized specificity forms the basis for inferring a structural element. Evolutionary and glycophylogenetic considerations also come into play, based on the presence of linkage types and/or GTs in related, predecessor, or derivative groups, potentially tracing back to the LECA. Despite the availability of all predicted glycogenes for a few protists (53,66,67), we are still a long way from being able to infer the biochemical function of many glycogene paralogs and, thereby, the glycome based solely on genetic information. If your "knowledgebase" is firmly established, then your own computer-and/or brain-based database can be a start for comparisons with other organisms or for glycoproteomics.
Together with the nascent technology of natural glycan arrays, then we can start to think about the functions for all this glycobiodiversity! Acknowledgments -Hanke van der Wel, Katharina Paschnger, and Donovan Cantrell are thanked for their suggestions. Conflicts of interest -The authors declare no competing interests.