Proteomics Reveals N-Linked Glycoprotein Diversity in Caenorhabditis elegans and Suggests an Atypical Translocation Mechanism for Integral Membrane Proteins*S

Protein glycosylation is one of the most common post-translational modifications in eukaryotes and affects various aspects of protein structure and function. To facilitate studies of protein glycosylation, we paired glycosylation site-specific stable isotope tagging of lectin affinity-captured N-linked glycopeptides with mass spectrometry and determined 1,465 N-glycosylated sites on 829 proteins expressed in Caenorhabditis elegans. The analysis shows the diversity of protein glycosylation in eukaryotes in terms of glycosylation sites and oligosaccharide structures attached to polypeptide chains and suggests the substrate specificity of oligosaccharyltransferase, a single multienzyme complex in C. elegans that incorporates an oligosaccharide moiety en bloc to newly synthesized polypeptides. In addition, topological analysis of 257 N-glycosylated proteins containing a putative single transmembrane segment that were identified based on the relative positions of glycosylation sites and transmembrane segments suggests that an atypical non-cotranslational mechanism translocates large N-terminal segments from the cytosol to the endoplasmic reticulum lumen in the absence of signal sequence function.

Protein glycosylation is one of the most common posttranslational modifications in eukaryotes and affects various aspects of protein structure and function. To facilitate studies of protein glycosylation, we paired glycosylation site-specific stable isotope tagging of lectin affinity-captured N-linked glycopeptides with mass spectrometry and determined 1,465 N-glycosylated sites on 829 proteins expressed in Caenorhabditis elegans. The analysis shows the diversity of protein glycosylation in eukaryotes in terms of glycosylation sites and oligosaccharide structures attached to polypeptide chains and suggests the substrate specificity of oligosaccharyltransferase, a single multienzyme complex in C. elegans that incorporates an oligosaccharide moiety en bloc to newly synthesized polypeptides. In addition, topological analysis of 257 N-glycosylated proteins containing a putative single transmembrane segment that were identified based on the relative positions of glycosylation sites and transmembrane segments suggests that an atypical non-cotranslational mechanism translocates large N-terminal segments from the cytosol to the endoplasmic reticulum lumen in the absence of signal sequence function.

Molecular & Cellular Proteomics 6:2100 -2109, 2007.
Protein post-translational modifications (PTMs) 1 such as proteolysis or addition of a chemical group to one or more amino acid residues may change the properties of a protein.
A large body of evidence suggests that PTMs are critical for various cell regulatory and signaling processes (1), and thus the analysis of the status of PTMs on proteins is a major objective of proteomics research. The development of certain new technologies for mapping PTMs on a proteomic scale has begun to yield fruitful results (2)(3)(4)(5)(6)(7)(8). Because all PTMs accompany changes in molecular mass of a protein, MSbased analysis is often selected for large scale PTM analyses.
Among the approximately 200 different known PTMs (9), protein glycosylation is one of the most common in eukaryotes: on average there are potential targets in more than half of the genes encoded in eukaryotic genomes (10). Protein glycosylation plays a role in protein folding, subcellular localization, turnover, activity, protein-protein interactions, etc. and contributes significantly to physiology as evidenced by the growing number of human diseases with defects in glycoconjugate assembly and processing (11,12). Thus, the analysis of protein glycosylation is important for both basic biology and clinical applications, including the discovery of protein biomarkers for diagnosis and drug discovery. Previous studies show that protein glycosylation is quite diverse because the oligosaccharide structure may vary widely between different proteins. In addition, a single protein can be glycosylated at multiple sites, and subsequent processing may differentially or partially modify an oligosaccharide attached at each site. These factors generate the observed complexity of glycoprotein structure and cause difficulties in characterizing protein glycosylation on a proteomic scale. At present, little is known about the final structure of most glycoproteins; however, the specific structure of each oligosaccharide and the rate of the modification(s) are often critical to individual glycoprotein function, and defects in these processes may cause disease (13). Thus, the mechanisms by which protein glycosylation is regulated remain a challenging problem for proteomics research.
Currently two methods allow large scale glycoprotein analysis directly from a complex biological mixture, and both methods utilize MS-based shotgun technology but differ in the way glycopeptides are collected. One of the methods captures glycopeptides, regardless of the glycan structure, on a solid support by chemical coupling between the cis-diol group of the glycan and hydrazide on the support, and then N-linked glycopeptides are released specifically from the support by peptide-N-glycanase (PNGase) digestion (14,15). Another method captures a subset of glycopeptides by lectin affinity chromatography (16 -18). The type of glycopeptides captured by this method depends on the specificity of the lectin used; however, comprehensive analysis of glycoproteins can be achieved by using multiple lectin columns with distinct binding specificity (e.g. non-reducing end oligosaccharides). This approach, termed isotope-coded glycosylation site-specific tagging (IGOT), includes a step to remove the glycan moiety of glycopeptides with PNGase in 18 Olabeled water (16). When the enzyme releases N-linked glycans in H 2 18 O, the glycosylated Asn residue (in the consensus tripeptide sequence for N-linked glycosylation, Asn-Xaa-(Ser/ Thr) where Xaa is any amino acid except Pro) is converted to Asp with concomitant incorporation of 18 O from water (19). This PNGase-mediated incorporation of the 18 O-tag distinguishes glycosylated peptides from non-glycosylated peptides that have non-enzymatically deamidated Asp residues. The conversion of Asn to Asp via 18 O incorporation in the glycosylation consensus sequence strongly indicates that the peptide was formerly N-glycosylated.
In this study, we paired IGOT with automated multidimensional liquid chromatography-MS technology and identified 1,465 N-glycosylated sites on 829 proteins expressed in Caenorhabditis elegans. We report here the diversity of protein glycosylation and the specificity of the oligosaccharyltransferase of C. elegans that incorporates an oligosaccharide moiety en bloc into nascent polypeptide chains. Based on the analysis of the relative positions of N-glycosylation sites and putative transmembrane segments of 257 potential integral membrane glycoproteins identified in this study, we also suggest that an atypical, non-cotranslational mechanism determines the topology of integral membrane glycoproteins.

EXPERIMENTAL PROCEDURES
Preparation of a Galectin 6 Column-The coding sequence of the C. elegans galectin 6 (GaL6) cDNA (provided by Dr. Hirabayashi, National Institute of Advanced Industrial Science and Technology (AIST), Ibaraki, Japan) was inserted into the Escherichia coli expression vector pET and introduced into E. coli BL21(DE3)pLysS (20). The transformant was cultured in M9CA medium containing 0.2 mg/ml ampicillin at 37°C, and gene expression was induced with 1 mM isopropyl 1-thio-␤-D-galactopyranoside at a midlog phase of growth (A 600 ϭ 0.6ϳ0.8). After further cultivation for 3 h, E. coli cells were lysed by sonication at 4°C in 50 mM sodium phosphate buffer, pH 7.5, and centrifuged at 10,000 ϫ g for 30 min. The supernatant was then applied to an asialofetuin column (Toyopearl 650M, 2.5-cm inner diameter ϫ 5 cm) equilibrated with 50 mM sodium phosphate buffer, pH 7.5, at a flow rate of 0.5 ml/min. After washing the column with the equilibration buffer, the adsorbed GaL6 was eluted with the same buffer containing 0.2 M lactose. The purified GaL6 (20 mg) was immobilized on TSK-GEL Tresyl-5PW (2 ml; TOSOH) according to the protocol provided by the supplier and was packed into a 4.6-mminner diameter ϫ 10-cm column.
Preparation of Tryptic Digests of Soluble and Insoluble Protein Fractions of C. elegans-C. elegans strain N2 was cultured in liquid medium at 20°C as described previously (16,21). A mixed growth phase culture of the worm (5-20 g, wet weight) was lysed by sonication in 5 volumes of TBS (50 mM Tris-HCl, pH 7.5, 150 mM NaCl) containing a protease inhibitor mixture (Sigma), and the homogenate was centrifuged at 1,000 ϫ g for 10 min at 4°C to remove cell debris. The soluble extract was then centrifuged at 100,000 ϫ g for 30 min at 4°C to separate the soluble and insoluble protein fractions. Each fraction was solubilized in 7 M guanidine HCl in 0.5 M Tris-HCl, pH 8.6, containing 50 mM EDTA, and the proteins were reduced with dithiothreitol and S-carbamoylmethylated with iodoacetamide (22). The S-carbamoylmethylated proteins were dialyzed against 10 mM HEPES-NaOH, pH 7.5, and digested with N ␣ -tosylphenylalanyl chloromethyl ketone-treated trypsin (Pierce) at an enzyme:substrate ratio of 1:50 at 37°C. After 18 h, an aliquot of protease inhibitor mixture (Sigma) was added to the mixture to stop digestion and to protect the lectin columns.
Preparation of Lectin-bound Glycopeptides-In our earlier attempts, we prepared an N-glycosylated protein fraction by lectin affinity chromatography of C. elegans crude extract and then obtained N-glycosylated peptides from a tryptic digest of the glycosylated protein fraction by a second round of lectin affinity chromatography (16). In this study, however, we modified the procedure to more efficiently identify the integral membrane glycoproteins; the crude protein extract was first digested with trypsin after S-carbamoylmethylation in 7 M guanidine HCl, and then the N-glycosylated peptides were recovered by lectin affinity chromatography. To increase the purity of glycopeptides, we incorporated an additional "hydrophilic interaction" chromatography step (23) before PNGasemediated 18 O labeling (described later).
To collect N-glycosylated peptides, the tryptic digests of soluble and insoluble protein fractions of C. elegans were subjected to affinity chromatography on three lectin columns, concanavalin A (Con A) (LA-Con A; 4.6-mm inner diameter ϫ 15 cm; Seikagaku Corp., Tokyo, Japan), wheat germ agglutinin (WGA) (LA-WGA; 4.6-mm inner diameter ϫ 15 cm; Seikagaku Corp.), or GaL6 (4.6-mm inner diameter ϫ 10 cm). Approximately 50 -200 mg of peptide mixture was applied to each column equilibrated with 10 mM HEPES-NaOH, pH 7.5. After washing the column with the equilibration buffer, adsorbed glycopeptides were recovered by elution with the buffer containing a cognate sugar: 0.2 M ␣-methyl mannopyranoside for the Con A column, 0.2 M N-acetyl-D-glucosamine (GlcNAc) for the WGA column, or 0.2 M lactose for the GaL6 column. To maximize the recovery of glycosylated peptides, the flow-through fraction of the first chromatography was applied again to the same lectin column, and the chromatography was repeated as described above. The glycopeptide fractions from individual lectin columns of the first and second rounds of chromatography were combined for subsequent steps.
Purification of Glycopeptides by Hydrophilic Interaction Chromatography-The N-glycosylated peptide mixture recovered by lectin affinity chromatography (10 -20 ml containing 200 -500 g of peptides) was added to an equal volume of ethanol (EtOH) and 4 volumes of 1-butanol (BuOH) and was applied immediately to a Sepharose CL-4B column (5-mm inner diameter ϫ 50 mm) equilibrated with the solvent H 2 O:EtOH:BuOH ϭ 1:1:4 (v/v/v). After washing the column with the same solvent, adsorbed glycopeptides were eluted with H 2 O:EtOH, 1:1 (v/v). The column eluent was monitored at 220 nm, and the recovered glycopeptides were quantitated fluorometrically after reaction with o-phthalaldehyde (24).
PNGase-mediated 18 O Labeling of Glycopeptides-N-Glycosylated peptides were labeled specifically with 18 O by IGOT as de-scribed previously (16). Briefly the sample glycopeptides were dried under vacuum to remove solvent containing H 2 16 O and then redissolved in 0.1 M Tris base prepared in H 2 18 O (Ն99 atom % 18 O; Taiyo Nippon Sanso Corp., Tokyo, Japan). The peptide solution was then adjusted to pH 8 -9 with a minimal volume of acetic acid, and then PNGase-A (lyophilized; Seikagaku Corp.), dissolved in H 2 18 O, was added to a final concentration of 1 milliunit/10 g of peptide. The reaction was incubated overnight at 37°C in a sealed polypropylene tube.
Automated 2D Nano-LC-MS/MS Analysis of 18 O-Tagged Peptides-The deglycosylated 18 O-tagged peptide mixture (approximately 5-10 g) was analyzed by automated 2D LC-MS/MS. The instrument used was a miniaturized version of that described previously (25,26) and was equipped with a first dimensional microscale cation-exchange column (1-mm inner diameter ϫ 50 mm) of Bioassist-S (7-m particles; TOSOH) and a second-dimensional direct nanoflow spray tip reversed phase column (150-m inner diameter ϫ 50 mm) of Mightysil-C 18 (3-m particles; Kanto Chemicals) connected in tandem through an electric column switching valve and an automated solvent desalting device. The chromatography was performed automatically under the time-dependent control program, and the eluate was directly sprayed into a high resolution Q-TOF hybrid mass spectrometer (Q-TOF Ultima; Waters-Micromass) at a flow rate of 100 nl/min. The spectrometer was operated in a data-dependent MS/MS mode where a full MS scan (1 s, m/z 400 -1500) was followed by two MS/MS scans (1 s each, m/z 100 -1500). The two most intensive precursor ions with a charge state (z) of ϩ2 or ϩ3 were dynamically selected and subjected to collision-induced dissociation with a collision energy as recommended by the manufacturer and a dynamic exclusion duration of 30 s. The total analysis time for a single 2D nano-LC-MS/MS process was 24 h.
Protein Identification by Database Search-The large volume of MS/MS data generated by the 2D nano-LC-MS/MS analysis was converted to text files using MassLynx software (version 4.0, Micromass). The peak list files were then created with smoothing by the Savitzky-Golay method (window channels, Ϯ3) using the same software and processed by the Mascot algorithm (version 1.9, Matrix Science, Ltd.) to assign peptides on the C. elegans Wormpep 124 protein sequence database (22,259 entries, www.sanger.ac.uk/ Projects/C_elegans/WORMBASE/current/wormpep.shtml). The database search was performed with the parameters as described previously (16,21) except that we defined a custom modification, "deamidation with 18 O (asparagine ϩ 3 Da)," for the deamidation of Asn incorporating 18 O. We first screened the candidate peptides with probability-based Mowse scores that exceeded their thresholds (p Ͻ 0.05) and with MS/MS signals for y-or b-ions Ͼ3; finally we selected "identified peptides" that contained one or more aspartic acid tagged with 18 O atoms on the basis of their MS/MS spectra. If a prospective "identified peptide" did not contain the consensus tripeptide sequence for N-linked glycosylation (Asn-Xaa-(Ser/Thr)), the data were eliminated regardless of the match score. The resulting dataset was finally evaluated by in-house software STEM (27) to remove unreliable Mascot peptide identifications and redundant assignments and to integrate the results with key parameters of the experiment.
Characterization of the Identified Glycoproteins-The transmembrane segment and the signal peptide of proteins were predicted by SignalP 3.0 (28) and/or ConPredII (29) bioinformatics tools.

Identification of N-Glycosylated Proteins Expressed in C. elegans
Because the glycosylation reaction takes place within the lumen of the endoplasmic reticulum (ER), N-linked glycopro-teins should also have a signal sequence and/or a transmembrane segment as discussed later. We identified these two structural elements in ϳ25% of the 22,500 genes predicted from the genome sequence of C. elegans (Wormpep), suggesting that there are ϳ6,000 potential targets for N-linked glycosylation. To catalog N-glycosylated proteins expressed in C. elegans and to study details of protein glycosylation, we used IGOT coupled with MS-based proteomics. To increase the coverage, we used three types of lectin columns with different binding specificity for the oligosaccharide attached to the polypeptide chain; thus, the columns contained immobilized Con A, WGA, and GaL6 (20), which are specific for the non-reducing end of Man, GlcNAc, and Gal, respectively. In addition, the lectin affinity chromatography was performed with tryptic peptide mixtures derived from soluble and insoluble protein fractions of C. elegans crude extract (see "Experimental Procedures"). The glycopeptide mixtures were further purified by hydrophilic interaction chromatography on Sepharose CL-4B, subjected to IGOT (i.e. N-glycanase-mediated 18 O labeling), and analyzed by automated 2D nano-LC-MS/MS shotgun technology to identify 18 O-labeled formerly N-glycosylated peptides. To maximize the number of identifications, the shotgun analysis was repeated three times for each peptide mixture prepared by Con A, WGA, and GaL6 affinity chromatography of the soluble/insoluble fractions. Supplemental Table 1 lists all the candidate glycosylated peptides in C. elegans identified in this study and all their MS/MS spectra are shown in Supplemental Fig. 1-1 to 1-9.
Supplemental Table 2 lists the C. elegans N-glycosylated proteins and the number of glycosylation sites identified in this study. We  Table 2). The number of glycosylated sites assigned on each protein ranged from 1 to 24 with an average of 1.5. The glycoproteins we identified were quite diverse in terms of subcellular localization and function, etc., yet many (approximately 50%) were integral membrane proteins such as cell surface receptors, transporters, channels, extracellular matrix proteins, and proteases.

Structural Heterogeneity of Oligosaccharides of the Glycoproteins
Previous studies have shown that most glycans liberated from C. elegans membrane proteins contain neutral sugars and have an oligomannose-type structure and that approximately 80% of N-linked glycans in C. elegans have a nonreducing end mannose that is recognized by Con A (30, 31). These glycans lack sialic acid as the C. elegans genome has no sialyltransferase gene, implying that the glycan structure is relatively simple as compared with that of mammalian cells (32). Thus, our lectin affinity analysis of glycopeptides showed that the largest subset of N-glycoproteins was identified from the Con A-captured peptide mixtures (Fig. 1); however, the glycopeptides collected by each lectin column overlapped significantly (Supplemental Table 3). For example, 24 glycopeptides assigned for him-4 (F15G9.4) were identified using the Con A column, whereas some of these glycopeptides were also recovered from the WGA and GaL6 columns, suggesting that the him-4 product has a highly heterogeneous glycan structure. Of the 1,465 glycosylated sites we determined, 138 sites on 105 proteins were found redundantly in the peptides captured by the three lectin columns, and 317 sites on 228 proteins were found in the peptides captured by two of the lectin columns. Although a subset of those peptides should have hybrid-type glycan structures that would be recognized by multiple lectins, our study implies that most of the worm glycoproteins have complex glycoforms that are typical of eukaryotes. It should be noted that glycan structures may be heterogeneous not only at the protein level but also that each glycosylation site may carry a complex series of N-linked glycans if one particular peptide on a single protein is recovered by multiple lectin columns (e.g. Supplemental Table 3). Unlike these proteins, however, we also found that many proteins, such as neprilysin (ZK20.6) and integrin ␣ (F54G8.3), have relatively homogeneous glycans because multiple glycopeptides were identified only in the Con Abound fraction, suggesting that the proteins contain a high mannose-type oligosaccharide chain(s). However, our argument, based on the binding specificity of different types of lectins, should certainly be confirmed by direct structural analysis of the oligosaccharides attached to each site of the polypeptide chain.

Amino Acid Residues Close to the Glycosylated Site
Although we identified ϳ6,000 potential targets for N-glycosylation, not all those proteins were found to be N-glyco-sylated, and as a matter of course, not all Asn residues in the consensus sequences were glycosylated. This suggests that local sequence elements may help determine the specificity of oligosaccharyltransferase (OST), an enzyme responsible for attachment of an oligosaccharide to the newly synthesized polypeptide. We reported previously the frequencies of the amino acid residues around the 400 glycosylated sites on Con A-bound soluble proteins in C. elegans (16). In the present study, the analysis was performed for 1,465 unique glycosylated sites on 829 proteins captured on the three types of lectin columns (Table I). Within the consensus tripeptide sequence for N-linked glycosylation, Thr occurs at position 3 more than twice as frequently than Ser (819 Thr versus 359 Ser). Pro does not occur at positions 2 or 4 with only one exception. We also found that Cys occurs at positions Ϫ3 to 6 at 1.5-2.5 times greater frequency than that expected from natural abundance of this residue. However, we could not detect other strong amino acid preferences around the glycosylation sites. This suggests that the nematode OST can introduce an oligosaccharide to almost any Asn-Xaa-(Ser/Thr) (Xaa Pro) sequence if the nascent polypeptide chain has a properly folded tertiary structure or meets some other criteria. The genome sequence implies that C. elegans may have a single OST-translocon (protein-conducting channel) complex in the ER lumen, whereas the yeast Saccharomyces cerevisiae has multiple OST-translocon complexes that might have different specificities for other sequences near consensus glycosylation sites (33)(34)(35). However, the factors that guide OSTmediated glycosylation remain unknown.

Topological Analysis of Integral Membrane Glycoproteins
Classification of Integral Membrane Glycoproteins-Although N-linked glycoproteins are found throughout the secretory pathway, including the cell surface, the ER, Golgi, and lysosomes, the initial attachment of N-glycans to polypeptides occurs only in the ER lumen. Thus, proteins destined for N-glycosylation contain structural elements that serve as signals that target proteins to the ER. The best characterized signal is the signal peptide, a short hydrophobic segment at the N terminus of a nascent polypeptide chain that serves as a recognition site for signal recognition particle (SRP). Upon binding the signal peptide, SRP translocates the translating ribosome to the translocon on the ER membrane. The translocon then inserts the C-terminal portion of the polypeptide chain co-translationally into the ER (36). Subsequently the signal peptide is cleaved off from the nascent polypeptide chain by the ER enzyme signal peptidase, thereby exposing the newly emerged N terminus in the lumen of the ER. Thus, the signal peptide plays a major role in the determination of the transmembrane topology of integral membrane proteins. This mechanism generates the Type I transmembrane proteins that have single transmembrane segments (Table II). Another ER targeting signal is the signal anchor, which, like the signal peptide, contains a hydrophobic segment recognized by SRP. The internal signal anchor sequence generates Type II and Type III transmembrane proteins, which differ with regard to the membrane topology of the polypeptide chain, by inserting either the N-or C-terminal transmembrane segment into the ER (37). Unlike the signal peptide, however, the signal anchor sequence is not cleaved by signal peptidase and remains in the mature protein as a transmembrane segment.
To characterize the glycoproteins identified in this study, we searched for ER targeting signals in the amino acid sequences of C. elegans N-linked glycoproteins. We first searched for signal peptide sequences using SignalP 3.0 (neural network and hidden Markov model methods) (28) and ConPred II (module DetecSig) (29). We accepted the results of prediction only when both softwares gave identical results; otherwise we considered the results ambiguous (Table II) and did not further consider them. The number and position of transmembrane segments were predicted with ConPred II (29). The results, including several representative known glycoproteins, are summarized in Table II. Of the 829 N-linked glycoproteins identified in this study, 463 had putative signal peptides, whereas 238 did not; the remaining 128 proteins gave ambiguous results. Among the 463 signal peptide-containing proteins, 181 had a single putative transmembrane segment, 58 had multiple transmembrane segments, and 224 had no predicted transmembrane segment. Transmembrane segments were also found in 193 of 238 proteins that had no detectable signal peptide (76 with single transmembrane segments and 117 with multiple transmembrane segments). We also identified 45 "ovalbumin-like" N-linked glycoproteins with no detectable signal peptide or transmembrane sequence. Thus, our predictions suggest that the N-glycosylated proteins identified in this study included 224 secretory, 181 Type I, and 76 Type II/III proteins (Supplemental Table 4).
Analysis of Integral Membrane Glycoproteins Containing a Signal Peptide-Because the initial protein glycosylation takes place only in the ER lumen and the glycosylated segments do not cross the membrane bilayer, the structural segment around the glycosylated site must face toward the ER lumen or an equivalent topological space (e.g. the Golgi lumen). Our topological analysis of membrane glycoproteins was based on the positions of experimentally determined glycosylation sites and putative transmembrane segments on the polypeptide chains. To simplify our analysis, we focused on the 257 glycoproteins containing a single transmembrane segment. The translated polypeptide segment that follows the signal peptide is introduced into the ER lumen through the translocon. If the transmembrane segment is translated and recognized by the translocon, it exits laterally from the channel into the membrane lipid, and the C-terminal portion remains in the cytosol (36). Therefore, for single span transmembrane proteins containing a signal peptide, the N-terminal portion of the transmembrane segment resides in the ER lumen. We assigned 181 of these Type I transmem- The numbers in parentheses are the relative rate estimated from the actual rate (ϭnumber of occurrences/1,178) and the expected rate from the natural abundance of each amino acid. brane proteins, and 160 of those were actually glycosylated only at the N-terminal portion of the transmembrane segment (Fig. 2). Twelve proteins were glycosylated on the C-terminal side of the transmembrane segment, and nine were glycosylated on both sides. Thus, our ability to predict signal peptides and transmembrane segments was quite good (160/ 181 ϭ 88%) for single span transmembrane proteins. Fig. 2 indicates that the Type I transmembrane proteins generally have an extracellular or luminal N-terminal segment that is much longer than the intracellular C-terminal segment: the average length of N-and C-terminal segments was 577 and 104 residues, respectively. Typical examples having a long N-terminal segment include membrane-anchored enzymes such as phospholipase, carboxypeptidase, and UDPglucosyltransferase, and extracellular matrix proteins such as cadherin and integrin. Conversely a number of enzymes, including tyrosine kinase, tyrosine phosphatase, and guanylate cyclase, have a long (Ͼ100-residue) C-terminal cytosolic segment. Besides those single span transmembrane proteins, we detected many Type I transmembrane proteins with a signal peptide and multiple transmembrane segments (Supplemental Table 3).
Analysis of Integral Membrane Glycoproteins Lacking a Signal Peptide-We performed a similar analysis on the single span proteins lacking a signal peptide (Fig. 3). Of 76 proteins, 48 and 26 were assigned as Type II and Type III transmembrane proteins, respectively. Only two proteins were modified on both sides of the transmembrane segment. Fig. 3 shows the positions of the glycosylation sites and the putative transmembrane segment on the polypeptide chains of Type II and Type III proteins. In the Type II proteins, the putative transmembrane segment appeared immediately after a short Nterminal sequence. The N-terminal segments that preceded the transmembrane segment had an average length of 82 residues with 39 of 48 Type II proteins (ϳ80%) containing segments of less than 100 residues. This suggests that the transmembrane segment, or internal signal anchor, may replace the signal peptide function when the nascent polypeptide is targeted to the ER. The Type III transmembrane proteins, however, are clearly different from the Type II proteins in The number of proteins classified into each group is indicated in parentheses. The presence of a signal peptide (SP) and the number of transmembrane segments (TM) were predicted as described in the text. Y illustrates the position of an oligosaccharide attached to the polypeptide chain. EGFR, epidermal growth factor receptor; ABC, ATP-binding cassette.  . In this figure, the Type I single span transmembrane proteins are tentatively classified into those having a small (Ͻ100-residue) or a large (Ͼ100-residue) cytosolic segment, indicated by a or b, respectively. In a number of glycoproteins (c), the glycosylated sites are located in the putative C-terminal segment or at both sides of the transmembrane segment probably due to inaccurate prediction of the transmembrane segment. their length of the N-terminal segments. Namely Type III proteins have an average of 520 residues between the N terminus and transmembrane segment with the length ranging from 36 to 2,086 residues. Unlike the Type II proteins, 22 of 26 (ϳ85%) Type III proteins have long N-terminal segments often far exceeding 100 residues (Fig. 3). For example, UNC-5 (netrin receptor/axon guidance protein) has a ϳ350-residue N-terminal segment that consists of an extracellular Ig-like domain (Prosite document PDOC50835 in ExPASy, www.expasy.org) and two thrombospondin domains (Prosite PDOC50092). The C01G6.8 gene product (CAM-1) has an N-terminal ϳ450residue segment consisting of the Ig-like, Frizzled (Prosite PDOC50038), and Kringle2 (Prosite PDOC50020) domains upstream of the putative transmembrane segment. Our prediction that UNC-5 and CAM-1 have a Type III topology is reasonable because both proteins have typical intracellular domains, ZU5 (Prosite PDOC51145) and the death (Prosite PDOC50017) domains (UNC-5) or a protein kinase domain (CAM-1), downstream of their putative transmembrane segments.

Conclusion
We identified a series of N-linked glycoproteins by collecting multiple subsets of N-linked glycopeptides from tryptic  Type III   2000  1500  1000  500  1:TM  500  1000  1500  2000   N terminus <---- digests of soluble and insoluble fractions of crude extracts of C. elegans using three types of lectin columns followed by IGOT-LC-MS/MS analyses of the glycopeptides. The analysis yielded a large dataset of N-glycosylated proteins as well as the sites of glycosylation (Supplemental Table 2). Although the IGOT strategy reveals the status of glycosylation of cellular proteins, it also implies the structure of N-glycans attached to the polypeptide chain (Supplemental Table 3). The oligosaccharide chain is synthesized without a template, and the final oligosaccharide structure is determined by many factors, such as the types of glycosyltransferases and glycosidases in the cell, enzyme/substrate concentrations, and the primary and higher order structures of the substrate protein, etc. The specific structure of the oligosaccharide chain contributes to the structure/function of particular glycoproteins, and there is an expanding number of human diseases with defects in glycoconjugate assembly and processing (11,12). This study suggests that OST, a single multienzyme complex in C. elegans, recognizes the consensus tripeptide glycosylation sequence as well as additional determinants to introduce an oligosaccharide to an Asn residue; however, it is still difficult to predict which consensus sites will undergo glycosylation and the subsequent processing that yields the final oligosaccharide structure. Clearly more extensive analysis is required to understand the glycosylation machinery of the cell.
Based on the structural characteristics of putative single span transmembrane glycoproteins identified in this study, we propose an atypical translocation mechanism for Type III transmembrane proteins in which newly synthesized polypeptide is post-translationally translocated into the ER through the translocon (Fig. 3). The mechanism underlying transmembrane protein topogenesis remains controversial (38) mainly because definitive structural data on integral membrane proteins are limited. In particular, there have been few naturally occurring Type III proteins reported (e.g. synaptotagmin II (38), microsomal cytochrome P450 (39), and yeast Golgi ecto-ATPase Ynd1p (40)), and therefore Type III transmembrane proteins are thought to have unusual topologies. It is generally accepted that the topology of Type II/III proteins is determined by the interaction between the translocon channel and the transmembrane segment of each protein essentially according to the "positive-inside" rule (38,41). It is also believed that Type III transmembrane proteins have a relatively short N-terminal segment preceding the signal anchor or transmembrane sequence. This is due to the fact that the translocation of this segment is a post-translational event, and thus it must retain an unfolded structure to pass through the translocon. Nevertheless our results indicate that Type III transmembrane proteins are distributed much more widely among eukaryotic cells than previously thought and that they have an apparently longer N-terminal segment compared with Type II proteins. The mechanism of the generation of Type III proteins must therefore involve either a chaperonin-like activity (to maintain the long N-terminal polypeptide segment in an un-folded structure) or the unfolding of the tentatively folded structure prior to translocation. It is also likely that an energydependent mechanism is required to insert the nascent protein into the translocon of the ER. Our analysis for the Type III proteins assigned here indicates that many have an N-terminal structure stabilized by disulfide bridges, such as Ig-like domain, Frizzled domain, Kringle2, FN3 (fibronectin type III), EGF3 (epidermal growth factor), Sushi, LDLRA2 (LDL-receptor class A), LDLRB (class B), and AMOP (adhesion-associated domain in MUC4 and other proteins) domains. We assume that these domains retain a relatively flexible, unfolded structure in the reducing environment of the cytosol that might be advantageous for the post-translational translocation of the polypeptide into the ER. Thus, our findings suggest that the Type III transmembrane protein architecture is widespread in eukaryotic cells. Further studies of the mechanism that determines the topogenesis of integral membrane proteins will assess the validity of these predictions.