Analysis of Human C1q by Combined Bottom-up and Top-down Mass Spectrometry

C1q is a subunit of the C1 complex, a key player in innate immunity that triggers activation of the classical complement pathway. Featuring a unique structural organization and comprising a collagen-like domain with a high level of post-translational modifications, C1q represents a challenging protein assembly for structural biology. We report for the first time a comprehensive proteomics study of C1q combining bottom-up and top-down analyses. C1q was submitted to proteolytic digestion by a combination of collagenase and trypsin for bottom-up analyses. In addition to classical LC-MS/MS analyses, which provided reliable identification of hydroxylated proline and lysine residues, sugar loss-triggered MS3 scans were acquired on an LTQ-Orbitrap (Linear Quadrupole Ion Trap-Orbitrap) instrument to strengthen the localization of glucosylgalactosyl disaccharide moieties on hydroxylysine residues. Top-down analyses performed on the same instrument allowed high accuracy and high resolution mass measurements of the intact full-length C1q polypeptide chains and the iterative fragmentation of the proteins in the MSn mode. This study illustrates the usefulness of combining the two complementary analytical approaches to obtain a detailed characterization of the post-translational modification pattern of the collagen-like domain of C1q and highlights the structural heterogeneity of individual molecules. Most importantly, three lysine residues of the collagen-like domain, namely Lys59 (A chain), Lys61 (B chain), and Lys58 (C chain), were unambiguously shown to be completely unmodified. These lysine residues are located about halfway along the collagen-like fibers. They are thus fully available and in an appropriate position to interact with the C1r and C1s protease partners of C1q and are therefore likely to play an essential role in C1 assembly.

sites between C1q and the tetramer are confined to the collagen region (11). Electron microscopy studies of the crosslinked C1 complex showed that the tetramer is located in the cone formed by the six collagen-like stems between the globular heads and the stalk (12,13). Electron microscopy and neutron scattering studies of isolated C1s-C1r-C1r-C1s indicated an extended conformation of the tetramer in solution (14,15). However, the volume inside the C1q cone is too small to contain the entire elongated tetramer. Based on these considerations, on neutron scattering studies of C1 (16), and on the C1s CUB1-EGF homodimer x-ray structure, a structural model of C1 has been proposed in which the tetramer undergoes a large conformational change into a compact "eight-shaped" structure wound in and out of the C1q stems (17). This model has been recently refined based on the identification of the C1q binding sites of C1r and C1s (6). It has been proposed that both C1r/C1s CUB1-EGF-CUB2 heterodimers are located inside the C1q cone and mediate ionic interactions through acidic residues contributed by the C1r and C1s CUB modules. Such ionic interactions at the C1q/ C1s-C1r-C1r-C1s interface are expected to involve lysine residues of the collagen-like stems of C1q as suggested by the observation that chemical modification of C1q with lysinespecific reagents inhibits C1 assembly and C1q hemolytic activity (4,18).
Being collagen-like, the sequences of the CLR chains contain the repeating Gly-X-Z triplet where X is often a proline and Z is frequently a hydroxylysine or a hydroxyproline. C1q contains 8.3% carbohydrate (7), most of it being present in the CLR as glucosylgalactosyl disaccharide units linked to hydroxylysine residues (19). Because of the steric hindrance arising from the disaccharide moieties, non-glycosylated lysine residues are more plausible candidates for ionic interactions at the C1q/C1s-C1r-C1r-C1s interface. It was previously reported that 82.6% of the hydroxylysines are glycosylated (19) so that few unmodified lysine residues are available for this purpose. The primary structures initially reported in the 1970s did not allow the post-translational modifications (PTMs; hydroxylations and glycosylations) to be fully accurately identified along the CLR chains (20 -25). Furthermore, the full-length C1q protein could not be analyzed to date by nuclear magnetic resonance or x-ray crystallography so the basic residues of the C1q stems involved in the C1q/tetramer interface remain to be identified. We have previously reported the MALDI-TOF mass spectrometry analysis of the entire C1q protein (26). The aim of the present study was to identify the C1q PTMs by MS and locate them along its chains to obtain further insights into the lysine residues likely involved in the interaction with the C1s-C1r-C1r-C1s tetramer. For this purpose, we performed for the first time a comprehensive proteomics study of C1q by using two very complementary approaches. On the one hand, C1q and CLR samples were digested by a combination of collagenase and trypsin, and the resulting peptides were analyzed by capillary LC-MS/MS (nano-LC-MS/MS). On the other hand, the bottom-up approach was completed by top-down analyses consisting of the direct infusion into the LTQ-Orbitrap instrument of the intact CLR domain and of the full-length A, B, and C chains of C1q. Spectra were successively acquired in the MS mode in the Orbitrap analyzer to obtain high accuracy and high resolution measurements of the intact protein masses and in the MS n mode to yield iterative fragmentation spectra of the proteins. The combination of bottom-up and top-down data proved most successful in yielding a significant coverage of the three chain sequences and in deciphering the heterogeneity of the C1q covalent modifications both in terms of proline hydroxylation and lysine glycosylation, thus revealing an unexpected level of complexity of the post-translational modifications.
Preparation of C1q and Its CLR-Human C1q was purified as described previously (27). For comparison, a commercial preparation of human C1q was purchased from Calbiochem. The CLR of C1q was prepared as described previously (28,29).
Protein Reduction and Alkylation-A first series of protein samples was prepared as follows. Twenty microliters of commercial and noncommercial C1q as well as 10 l of CLR (20, 17.8, and 32 g of protein, respectively) were reduced by incubation for 1 h at 60°C in the presence of 4.5 mM DTT. Samples were then alkylated by addition of 22 mM IAA and incubation for 45 min at room temperature in the dark. In view of the previously reported overalkylation of proteins by IAA (30), a second series of C1q samples was prepared by reduction with 5 mM tris(2-carboxyethyl)phosphine at 37°C for 45 min and alkylation with 10 mM MMTS at room temperature for 15 min.
Protein Digestion by Collagenase and Trypsin-Reduced and alkylated samples were supplemented with collagenase to obtain an enzyme to substrate ratio of 0.5 (w/w) and incubated for 3 h at 37°C. Before performing trypsin digestion of the mixture containing the C1q globular regions and collagenase, 50 mM NH 4 HCO 3 , pH 8.0, was added to the samples to reach a final concentration of 20 mM to obtain pH conditions more suitable to the serine protease activity (pH Ͼ7.5). Two and 3 l of trypsin at 0.4 g/l were then added to each C1q and CLR sample, respectively, before incubation for 18 h at 37°C.
Reversed Phase LC-MS/MS Analysis on QqTOF Instrument-A first series of LC-MS/MS analyses was performed using a Famos-Switchos-UltiMate chromatographic system (LC Packings/Dionex) coupled to a hybrid quadrupole QqTOF mass spectrometer QSTAR Pulsar i (Applied Biosystems/MDS Sciex) equipped with a nanoelectrospray source Protana XYZ manipulator (Protana, Odense, Denmark). Protein digests (1 pmol) were loaded onto a C 18 precolumn (PepMap100 C 18 , 300-m inner diameter, 5-mm length, 5-m particle size, 100-Å porosity; Dionex) for desalting and concentrating at a flow rate of 30 l/min in solvent A (water/acetonitrile/formic acid, 98:2:0.1, v/v/v). Peptides were then eluted from the precolumn and separated on a capillary column (PepMap C 18 , 75-m inner diameter, 150-mm length, 3-m particle size, 100-Å porosity; Dionex) at 200 nl/min using a gradient as follows: solvent A for 5 min, linear increase to 60% solvent B (water/acetonitrile/formic acid, 10:90:0.1, v/v/v) in 50 min, then ramp to 90% B in 5 min (held 10 min), and return to 100% A in 5 min for a 15-min-long re-equilibration of the columns. The autosampler was kept at 10°C. Peptides eluting from the column were analyzed using the information-dependent data acquisition feature in the Analyst QS software v1.1: species ionized in the nano-ESI source were detected for 1 s in the MS mode, and the three most intense signals associated to either doubly or triply charged species were subsequently selected to be fragmented in the MS/MS mode for 3 s each. MS detection and MS/MS acquisitions were performed over the m/z ranges 400 -1400 and 100 -2000, respectively. Analyses were carried out with the dynamic exclusion of already fragmented m/z values for 3 min.
Reversed Phase LC-MS/MS Analysis on LTQ-Orbitrap Instrument-A second series of LC-MS/MS analyses was realized using an UltiMate 3000 chromatographic system (Dionex) coupled to a hybrid LTQ-Orbitrap XL mass spectrometer (Thermo Fisher Scientific). Typically, 200 -300 fmol of digested C1q or CLR were injected onto a C 18 precolumn (Dionex). After desalting for 5 min with buffer A (0.1% formic acid in water), peptides were separated on a capillary column (same reference as the one used on the QqTOF instrument) using a gradient from 100% solvent A to 60% solvent B (water/acetonitrile/ formic acid, 10:90:0.1, v/v/v) in 60 min. The column was then further washed with 95% solvent B for 10 min. One series of MS analyses (ITMS 2 analyses) consisted of acquiring cycles composed of one MS scan in the Orbitrap analyzer (profile mode; resolution, 15,000; m/z range, 400 -2000) followed by three MS/MS scans (CID fragmentation and detection in the linear ion trap analyzer; centroid mode; isolation width, 2 Da) triggered on the three most intense species detected in the preceding MS scan. Singly charged species were excluded from fragmentation; dynamic exclusion of already fragmented ions was applied for 90 s with a repeat count of 1, a repeat duration of 20 s, and an exclusion mass width of Ϯ5 ppm. Automatic gain control allowed accumulation of up to 5⅐10 5 ions for FTMS scans, 10 5 ions for FTMS n scans, and 10 4 ions for ITMS n scans. The maximum injection time was 100 ms for acquiring FTMS and ITMS n scans. Only one microscan was acquired for each scan type, although three were accumulated in FTMS mode in initial experiments. When testing acquisitions consisting of CID fragmentation in the ion trap analyzer and detection of the resulting fragments in the Orbitrap (OTMS 2 analyses), the maximum injection time was 200 ms. In experiments combining MS 2 and neutral loss-triggered MS 3 scans (MS 2 /MS 3 analyses), MS 2 spectra were only acquired in the ion trap on the two precursor ions giving the most intense signals in MS, and MS 3 was launched whenever a neutral loss of m/z 108.035, 162.053, 216.070, 324.106, and 486.158 was detected with a Ϯ0.5-Da tolerance among the eight most intense MS 2 fragments. MS 3 was oriented toward the MS 2 fragment corresponding to the biggest neutral loss. For these MS 2 / MS 3 analyses, a precursor selection window of 2 and 5 Da was used for MS 2 and MS 3 scans, respectively. In addition, three microscans were accumulated to build an MS 3 scan. Acquired raw data were processed by the software Bioworks to create Mascot-compatible .MGF files (no grouping of MS/MS scans was allowed).
Protein Identification Using Nano-ESI Direct Infusion on LTQ-Orbitrap-Five microliters of reduced non-commercial C1q and of CLR were desalted on a ZipTip C4 (Millipore), eluted in (water/acetonitrile/ formic acid, 50:50:0.5, v/v/v) to 5 pmol/l, and loaded into a metallized nanoelectrospray needle (PicoTip emitters, reference number BG12-69-2-CE-20, New Objective). A spray was obtained while working at a capillary temperature of 240°C and adjusting the voltage applied to the nano-ESI tip between 1.4 and 2.4 kV. Automatic gain control parameters were set to 2⅐10 6 for FTMS, 2⅐10 5 for FTMS 2 , and 10 4 for ITMS n scans. Target resolution was 60,000 for FTMS and FTMS 2 analyses. The maximum injection time was set to 500 ms in FTMS and FTMS 2 and to 100 ms in ITMS n scans. The precursor selection width for FTMS 2 fragmentation was around 3 Da and always adjusted to find a compromise between sensitivity and the clean selection of a single isotopic distribution. The selection window for ITMS n fragmentation (n Ն 3) was 5 Da. The spectra shown in this study correspond to the accumulation of scans over approximately 1 min, yet good signal to noise ratios could be obtained within less time.
Database Searches-LC-MS/MS data (.MGF files obtained from .WIFF or .RAW data) were searched using the Mascot software (Matrix Science). Acquired data were compared with a homemade database named "mature C1q" consisting of only five sequences extracted from Swiss-Prot database release 54.7: porcine trypsin (accession number P00761), collagenase (accession number Q9X721), and the three mature protein sequences, C1qA, C1qB, and C1qC (accession numbers P02745, P02746, and P02747; the signal peptides 1-22, 1-25, and 1-28, respectively, were removed from the sequences stored in Swiss-Prot to obtain the mature sequences). Searches were first performed while considering that the analyzed peptides resulted from the combined use of collagenase and trypsin (collagenase ϩ trypsin search). The sequential use of collagenase and trypsin to digest C1q was considered to produce peptides resulting from cleavages N-terminally of glycine residues and C-terminally of lysine and arginine residues. Collagenase is indeed known to specifically degrade collagen by cleaving N-terminally of GPX triplets. In a second step, searches were run by considering that trypsin may have produced half-tryptic peptides (collagenase ϩ semitrypsin search). In both searches, nine missed cleavages were allowed. Finally, an errortolerant search was performed after the two previous search steps to allow in particular identification of the N-terminal region of chain C1qB, which is known to be cyclized into pyrrolidone carboxylic acid. For data obtained on the QqTOF instrument, tolerances on mass measurement of precursors and fragments were set to 50 ppm and 0.4 Da, respectively. For Orbitrap data, tolerances on mass measurements were 5 ppm (FTMS) and 0.8 Da (ITMS n ). Cysteine residues were considered to be fully alkylated either by IAA or MMTS; hydroxylation of lysine and proline residues, oxidation of methionine residues, and glucosylgalactosyl modification on hydroxylysines were included as potential modifications. We modified the definition of the modification "glucosylgalactosyl (Lys)" initially present in Mascot for neutral losses of 162.0528 and 324.1056 mass units to be taken into account for scoring.

Bottom-up Analysis of C1q and Its CLRs
Three types of samples were analyzed to determine the PTMs of C1q: laboratory-purified C1q, commercial C1q, and CLR. Each of them was digested using successively collagenase and trypsin and analyzed by LC-MS/MS. To facilitate reading, hydroxylated proline and lysine residues are noted P* and K*, respectively, and a lysine residue bearing a glucosylgalactosyl (Glc-Gal) moiety is noted K# in sequences.
Overall Sequence Coverage of Chains A, B, and C-The digested samples were first characterized by triplicate LC-MS/MS analyses on QqTOF and LTQ-Orbitrap instruments (MS 2 analyses). In the latter case, fragmentation and detection of the fragments were carried out in the linear ion trap of the hybrid instrument (ITMS 2 analyses). The peptide sequences identified in the three chains constituting C1q are shown in Fig. 2. Database searches were performed assuming that collagenase would cleave N-terminally of glycine residues within the CLR regions (see "Experimental Procedures"). Indeed, nearly all identified sequences resulted from cleavages N-terminally of glycine residues and/or C-terminally of Lys/ Arg residues. It is worth mentioning that, except for glycosylated peptides (see below), the identification of a peptide was routinely validated only if a minimal continuous sequence stretch of five amino acids was identified due to y-type or b-type fragment ions. Nonetheless, the frequent occurrence of proline residues at the Z position within GXZ triplets led us to be more tolerant: some peptide identifications were accepted when series of y ions, separated by three residues, corresponded to fragments containing an N-terminal proline (31,32).
Identification of Hydroxylated Sequences-The CLR of C1q contains the repeating triplet Gly-X-Z where Z is frequently a proline or a hydroxyproline residue. In most cases, peptides containing hydroxylated residues (but no glycosylation) fragmented in a manner similar to non-modified species, allowing confident determination of their sequence; yet, as stated above, the frequent occurrence of proline residues was detrimental to the detection of continuous y/b fragment series. As presented in Fig. 2, several proline residues of the three C1q chains were detected in both a non-modified and a hydroxylated form, indicating incomplete modification. For instance, peptide 72 GPMGIPGEPGEEGR 85 of C1q C was confidently identified in a non-modified form as well as in singly and doubly hydroxylated forms with modifications observed on residue Pro 77 alone and additionally on Pro 80 . Taking into account the possible oxidation of methionine residues, this peptide was finally identified as doubly hydroxylated (on both Pro 77 and Pro 80 ) and oxidized on Met 74 . Although oxidation and hydroxylation yield equal mass increments, the detected MS 2 fragments ruled out the possible hydroxylation of Pro 73 . Several peptides containing one or two glycosylated lysine residues and fully or partially hydroxylated proline residues were also identified. Although identification of glycosylated sequences was often tedious (see section below), Pro 23 and Pro 35 from C1q A were clearly identified as being fully modified, whereas Pro 32 was confidently determined to be partially modified due to the detection of several overlapping peptides.

FIG. 2. Peptide identified by LC-MS/MS analyses of C1q successively digested with collagenase and trypsin using QqTOF and LTQ-Orbitrap instruments.
Green lines, peptides identified in both laboratory-purified and commercial C1q; black and violet lines, peptides only identified in either laboratory-purified or commercial C1q, respectively; blue lines, peptides specifically identified from the "collagenase ϩ semitrypsin" database search; dotted line (in C1q C), sequence GLPGPK#GEP* identified from the "collagenase ϩ trypsin" search was finally rejected in favor of sequence GPK#GEPGIP* (collagenase ϩ semitrypsin) because the latter better matched the experimental spectrum. Blue squares, hydroxylation; red squares, Glc-Gal modification; green squares, deamidation; empty square, pyrrolidone cyclization. When precise localization of one (two) hydroxylation(s) could not be established from MS/MS spectra, an extended striped blue/white rectangle overlapping the possibly modified residues (Pro or Lys) is shown.

Identification of Glycosylated Sequences-Composition
analysis has indicated that C1q contains 8.3% carbohydrate (7), most of it being present in CLR as Glc-Gal disaccharide units linked to 82.6% of the hydroxylysine residues (19). In contrast to peptides containing only hydroxylated residues, glycosylated sequences systematically provided MS 2 spectra dominated by sequential neutral losses of saccharide moieties (162.05 mass units), whereas fragmentation along the peptide backbone produced peaks of minor intensity, often precluding robust sequence determination and localization of the glycosylation site. We therefore tested different fragmentation/detection schemes on the LTQ-Orbitrap instrument to obtain complementary and redundant identification of glycosylated sequences and thus increase confidence in the identification of lysine residues bearing a Glc-Gal motif.
The MS 2 data acquired on the QqTOF and LTQ-Orbitrap instruments allowed identification of the glycosylated sequences shown in Fig. 2. All sequences identified by the Mascot software are shown provided that all fragment peaks of major intensities matched the theoretical fragments expected for the proposed sequence. We checked the relevance of hydroxylation positioning and reintroduced uncertainty when the detected fragments were insufficient to definitely localize this modification on a specific proline or lysine residue (see for example peptide 39 -59 of C1q A). Some peptides bearing one glycosylation and containing one lysine residue could be readily identified; supplemental Fig. S1 shows an MS 2 spectrum obtained on the QqTOF instrument and unambiguously identifying peptide 39 TGIQGL-K#GDQGEP 51 from C1q A. Determination was more challenging when a glycosylation site(s) occurred within sequences containing more lysine residues than the number of disaccharide moieties. This is exemplified in sequences 73-92 from C1q A, 69 -90 from C1q B, and 42-58 from C1q C, which all contain three lysine residues and were identified as being doubly modified by Glc-Gal motifs. The Mascot search software then often provided identification of sequences corresponding to the different possible combinations of free and glycosylated lysines. For example, peptide 69 -90 from C1q B was proposed to be modified either on Lys 71 and Lys 83 or on Lys 83 and Lys 90 . Interestingly, glycosylated lysine residues were sometimes detected as still bearing one sugar moiety (the lysine was modified by 178.0477 mass units) during fragmentation in the linear ion trap of the LTQ-Orbitrap instrument; this behavior was never observed using the QqTOF instrument with which initially glycosylated lysine residues were systematically detected in a simply hydroxylated form in MS 2 spectra. Two spectra showing lysine residues carrying a single sugar unit during MS/MS fragmentation are provided in supplemental Fig. S2, A and B; such fragmentation patterns helped us confirm the glycosylated nature of some lysine residues. More generally, to discriminate which lysine residues were most probably modified by Glc-Gal, we determined the best consensus resulting from the different pro-posed glycosylated sequences. Taking again the doubly glycosylated sequence 69 -90 from C1q B as an example, the identification of peptide 69 GPK#GGPGAP*GAP* 80 unambiguously pointed to Lys 71 as being glycosylated (Fig. 2); additionally, several MS 2 spectra identifying sequence 69 GPK#-GGPGAP*GAP*GP(KGESGDYK) 90 # (the glycosylation is located in the partial sequence in parentheses) allowed detection of the y2 ion at m/z 310.176 indicating that the C-terminal lysine was free. We therefore concluded on the consensus sequence 69 GPK#GGPGAP*GAP*GPK#GESGDYK 90 . This reasoning aiming to deduce the most probable modified residues in the three C1q chains from the largest number of converging sequences identified was applied systematically. The conclusions derived from MS 2 analyses are collected in Fig. 3. In addition, identification of glycosylated sequences was further confirmed by performing LC-MS/MS analyses of digested CLR and detecting both the precursor and the fragment ions in the Orbitrap cell (OTMS 2 analyses) (35). Visual inspection of the fragmentation data allowed us to select MS 2 spectra from glycosylated sequences by pointing to multiple losses of 162.0528 mass units. Then, by simply combining the knowledge of the accurate mass of the precursor and of the number of sugar losses, we could match certain experimental spectra to theoretical sequences of C1q obtained by in silico digestion of the three chains (considering cleavages N-terminally of Gly and C-terminally of Lys/Arg residues). The glycosylated sequences thus identified are listed in supplemental Table S1. This highlighted the fact that the knowledge of the precursor mass at 5-ppm accuracy and of its glycosylated nature was often sufficient to unequivocally identify sequences from C1q; this observation supports the identification of glycosylated sequences obtained from the previous ITMS 2 data acquired on the LTQ-Orbitrap instrument.
Identification of glycosylated sequences was hampered by the preferential loss of the labile disaccharide moieties upon CID fragmentation. This is usually also observed with peptides phosphorylated on Ser/Thr residues of which a better characterization has been largely described using an additional fragmentation step (MS 3 ) triggered on the MS 2 fragment corresponding to the loss of phosphoric acid (36). To more confidently identify glycosylated peptides from the C1q chains, we therefore analyzed digested C1q and CLR samples while specifying the acquisition of MS 3 spectra on the MS 2 fragments corresponding to the heaviest detected sugar loss from 2ϩ and 3ϩ precursor species. MS 3 spectra were then interpreted by considering lysines as possibly hydroxylated. The glycosylated sequences identified from these ITMS 3 analyses are represented in detail in Fig. 4 and merged in Fig. 3. Interestingly, overlapping peptides containing Lys 78 and Lys 81 in C1q A mostly exhibited a loss of two disaccharide moieties between MS 2 and MS 3 but also happened to only lose one such modification. This confirmed that these two lysines were mostly doubly glycosylated but also occurred as a glycosylated/hydroxylated pair as observed previously in MS 2 analyses (Fig. 2).

Outcome of Bottom-up Analyses-The analysis of digested
C1q and CLR samples provided significant sequence coverage of the three C1q chains. The majority of proline residues considered to be hydroxyproline in previous reports (Fig. 3, bold) were confirmed to be fully modified (22). Nonetheless, our study revealed a more subtle modification pattern in which some prolines (e.g. Pro 31 , Pro 51 , and Pro 57 from C1q A and Pro 20 , Pro 53 , and Pro 56 from C1q B), recorded as either modified or free, were in fact shown to be partially hydroxylated. In addition, C1q A, B, and C were determined to bear (minimally) four, five, and three glycosylations, respectively, on 5-hydroxylysine residues. In particular, three previously unknown glycosylation sites were identified on Lys 50 of C1q B and Lys 29 and Lys 44 of C1q C. Besides, peptides containing Lys 78 and Lys 81 of C1q A were observed as either doubly or singly glycosylated; similarly, Lys 83 from C1q B was mostly observed in a glycosylated form but was also found in a hydroxylated form in one peptide. Nevertheless, several sequence stretches with potential PTMs remained uncovered by this bottom-up approach, such as in the N-terminal end and the globular region of C1q A. Furthermore, these data revealed an additional, unanticipated level of post-translational heterogeneity. To increase the sequence coverage and address the question of the PTM heterogeneity at the whole protein level, we next completed the bottom-up analyses by acquiring top-down data.

Top-down Analysis of C1q and Its Collagen-like Regions
LTQ-Orbitrap instruments have already been demonstrated to allow analysis of intact proteins up to 50 kDa, to determine their molecular mass with ppm accuracy, and to yield sequence information due to iterative MS n level analyses (37). C1q was therefore subjected to top-down characterization at the C1q A, B, and C chain levels (Ͻ26 kDa); similarly, the CLR polypeptides were analyzed by iterative MS n experiments (up to MS 4 ). The latter sample was obtained from C1q using pepsin, a nonspecific enzyme that preferentially cleaves on the C-terminal side of aromatic residues (Phe, Trp, and Tyr) as well as Leu, Ala, Glu, and Gln (proteolysis at pH Ͼ 2), gener- Fully blue squares, residue always detected as hydroxylated; blue/white squares, residue detected either as unmodified or hydroxylated, indicating partial modification; red squares, lysine always detected in a glycosylated form; red/blue squares, lysine detected either as glycosylated or hydroxylated. ? indicates that ambiguity remains as to whether Lys 10 or Lys 11 of C1q A is glycosylated.
ating several possible sequences. The challenge in this study was to handle the simultaneous analysis, without prior separation by LC, of a mixture consisting of different protein chains exhibiting variable levels of hydroxylation and glycosylation. This complexity is illustrated by the mass spectra of reduced CLR and C1q acquired by direct nano-ESI infusion of the proteins on the LTQ-Orbitrap analyzer. Ionic species with charge states ranging from 7 to 17ϩ and from 18 to 28ϩ were detected during MS analysis of CLR (Fig. 5A) and reduced C1q (Fig. 5B), respectively. The corresponding deconvoluted spectra are provided in Fig. 6, A and B. The sequences that could be attributed to the signals detected during top-down analysis of C1q and CLR are listed in Table I. More details on the MS n data that allowed establishing these matches are provided in the next section.
Four signals, designated A-1 to A-4 in Fig. 6A, were detected for the A chain in the CLR sample. A-1 and A-2 were attributed to sequences 1-97 and 1-95 from C1q A, respectively, decorated with eight hydroxylations and five glycosylations. The lower intensity signals A-3 and A-4 corresponded to the same sequence stretches yet bearing only four glycosylations (Ϫ340.10 mass units). These top-down data indicated a heterogeneous level of glycosylation for the A species. Because the residue pair (Lys 78 , Lys 81 ) was detected by bottom-up analysis as being either doubly glycosylated or hydroxylated/glycosylated, one of these residues must be glycosylated in chains A-1 and A-2 and only hydroxylated in A-3 and A-4. Similarly, two signals, B-1 and B-2, could be attributed to sequence 1-97 from C1q B. Whereas B-1 bears 12 hydroxylations and five glycosylations, the minor fraction B-2 only carries four saccharide moieties but 13 hydroxylations (Ϫ324.1 mass units). In B-2, Lys 83 could lack glycosylation considering that this residue was found by bottom-up analysis to be either glycosylated or hydroxylated. Finally, four signals were assigned to different sequence stretches of C1q C: polypeptides 1-90, 1-92, 1-93, and 1-94 could be detected; each was decorated with three glycosylations and 14 hydroxylations. The most intense species, C-1, i.e. probably the most abundant CLR C1q C polypeptide, corresponded to the cleavage C-terminally of a Phe residue in agreement with the cleavage specificity of pepsin. Each peak pattern detected for the CLR polypeptides additionally consisted of a series of ionic species by increments of 16 mass units, obviously indicating a variable level of hydroxylation (15.9949 mass units). The ranges of hydroxylation motifs present on each C1q chain are indicated in Table I.  Fig. 2; additionally, orange lines, identifications from the CLR sample. Circled in red, lysines determined to be glycosylated from these MS 2 /MS 3 analyses; circled in black, unmodified lysines. m/z 817.35 ion is provided in Fig. 7. Detailed inspection of MS 2 fragments obtained on CLR C1q A and of the matched theoretical sequences allowed us to determine the position of some hydroxylations and glycosylations; this information is represented in Fig. 3. Based on these MS 2 data, the calculated monoisotopic mass of 11,410.4432 mass units (species A-1) could be attributed definitely to the CLR se-quence 1-97 of C1q A modified with five glycosylations and eight hydroxylations.

Iterative Fragmentation of CLR and C1q Chains A Chain Characterization-Species
To validate the N-linked glycosylation at Asn 124 , we acquired an MS/MS spectrum on the ionic cluster centered on m/z 1061.63 (26ϩ), corresponding to the C1q A chain (supplemental Fig. S3). The detected fragments of higher charge states (y 173 ion) and highest intensities at m/z

FIG. 5. Mass spectra of reduced CLR and C1q acquired by direct nano-ESI infusion of the proteins on LTQ-Orbitrap instrument.
A, MS spectrum obtained on reduced CLR analyzed at a protein concentration of 5 pmol/l. B, MS spectrum obtained on reduced C1q analyzed at a protein concentration of 5 pmol/l.

FIG. 6. Deconvoluted MS spectra obtained by top-down MS analysis of reduced CLR (A) and reduced C1q (B).
1189.7425 (18ϩ) and m/z 1259.9043 (17ϩ) merged into the same monoisotopic mass of 21,386.24 mass units in the deconvoluted spectrum (Fig. 8). This value matched the Cterminal region of C1q A starting at residue Pro 51 and bearing two O-glycosylations, four hydroxylations, and a monosialylated fucosylated biantennary N-glycan with a mass of 2059.74 mass units (theoretical mass, 21,387.28 mass units). In addition, the intense ion signals detected at m/z 1173.6829 (18ϩ) and m/z 1242.6631 (17ϩ), corresponding to a monoisotopic mass of 21,095.16 mass units, were assigned to the same C-terminal sequence, but the terminal sialic acid group was lost (⌬m, Ϫ291.073 mass units). The detection of sialylated glycan chains is in agreement with a previous analysis indicating that 80% of Asn-linked glycans of C1q contain sialic acid (38). Whereas a minor N-glycan population with two sialic acid residues has been reported (38), we found that these glycans were substituted with a single sialic acid residue. The signal detected at m/z 883.4193 (7ϩ) (b 50 7ϩ ion) corresponded to the N-terminal sequence 1-50 (6173.879 mass units), observed previously during fragmentation of CLR C1q A and bearing three glycosylations and four hydroxylations, and was complementary to the C-terminal fragment at 21,387.24 mass units. As a whole, five glycosylations and eight hydroxylations, as well as one sialylated fucosylated biantennary N-glycan of 2059.74 mass units, were thus confirmed to be present on C1q A, accounting for a mass of 27,561.29 mass units. The minor fraction possessing four Lys# probably contains one of the two lysines, Lys 78 and Lys 81 , in a hydroxylated form.
B Chain Characterization-The MS 2 spectrum acquired on the ionic species around m/z 1102.41 (10ϩ) is provided in Fig.  9, A and B. As reported above in the bottom-up analysis, preferential fragmentation was observed at the N terminus of proline residues. The sequence 1 %QLSCTG(PP)*AIP*GIP*-GIP*GTPGPD 23 could be confidently determined from the MS 2 spectrum, unambiguously identifying CLR C1q B. We could thus precisely match the B-1 polypeptide at mass 11,008.10 mass units, with 5.1-ppm error, to the sequence stretch 1-97 of C1q B (-----KATQKIAF) bearing five glycosylations and 12 hydroxylations and with a loss of NH 3 at the protein N terminus due to conversion of the glutamine residue into pyrrolidone carboxylic acid (Ϫ17.0265 mass units).
The ionic species around m/z 1104.02 (10ϩ) (deconvoluted monoisotopic mass of 11,024.10 mass units) was also fragmented and led to a fragmentation pattern similar to that obtained for the ion at m/z 1102.41 (10ϩ) (supplemental Fig. S4). As deduced from the mass difference between both 10ϩ species, they only differed by one hydroxylation, which could be placed within the sequence stretch 20 -23. Indeed, a common y 74 fragment was detected at 983.67 (9ϩ), whereas fragment y 78 was either seen at 1024.245 (9ϩ) or at 1026.135 (9ϩ), thus definitely localizing the additional hydroxylation within the sequence 20 PGPD 23 . Based on these results, we concluded that CLR C1q B molecules bear 11-14 hydroxylations, indicating a variable level of hydroxylation, as observed for CLR C1q A.
In addition to 8ϩ and 9ϩ fragments, the two fragmented polypeptides at m/z 1102.41 (10ϩ) and m/z 1104.02 (10ϩ) produced common 1ϩ and 2ϩ fragments at m/z 535.3027 (2ϩ), m/z 854.3668 (b 9 ϩ ion), m/z 967.4524 (b 10 ϩ ion), m/z 1250.6036 (b 13 ϩ ion), and m/z 1533.75 (b 16 ϩ ion). These MS 2 fragments were then selected to be fragmented in MS 3 . Table  II lists the peptide sequences determined from MS 3 scans, and the manually interpreted spectra are provided as supplemental data. Fragmentation of b ions validated the hydroxylation of Pro 8 , Pro 11 , and Pro 14 . Residue Pro 7 ap- peared as possibly hydroxylated from the MS 2 fragmentation spectrum obtained on m/z 1250.6036 (1ϩ); however, this modification would be unusual given its location at position 2 within a GXZ triplet. The doubly charged ion at m/z 535.3027 corresponded to the y 9 2ϩ ion, validating the C-terminal region of the CLR C1q B B-1 species as follows: 89 YKATQKIAF 97 .
C Chain Characterization-Due to the occurrence of Phe/ Gln/Tyr amino acids between residues 86 and 94 of the C1q C chain, several polypeptides could be expected to be generated from cleavage of this chain by pepsin. MS/MS fragmentation performed on the ions around m/z 1314.23 (8ϩ) (monoisotopic mass, 10,516.843 mass units) and m/z 1258.71 (8ϩ) (monoisotopic mass, 10,055.616 mass units) allowed identifying the same sequence, 1 NTGCYGIP*GM 10 , based on 7ϩ fragments (supplemental Figs. S5 and S6, respectively), thus unambiguously attributing this species to a CLR C1q C polypeptide.
The MS/MS spectra acquired on species around m/z 1314.23 (8ϩ), m/z 1316.38 (8ϩ), m/z 1256.71 (8ϩ), and m/z 1258.71 (8ϩ) allowed us to detect peptides at charge states between 1ϩ and 3ϩ, which were selected for MS 3 fragmentation followed by MS 4 when possible (supplemental data). Table II Fig. S7) unambiguously revealed the heterogeneous hydroxylation of biomolecules corresponding to that mass. Indeed, this ion fragmented into two y 18 2ϩ ions at m/z 1058.51 (corresponding to 77 P*GEP*GEEGRYKQKFQSVF 94 ) and m/z 1050.01 (corresponding to 77 (PGEP)*GEEGRYKQKFQSVF 94 ) as well as two y 24 2ϩ ions containing three ( 71 P*GPMGIP*GEP*GEE-GRYKQKFQSVF 94 ) or only two hydroxylated proline residues. The detection of these two pairs of y ions, with one or two hydroxylations in the sequence 77 PGEP 80 , indicated the presence of counterbalancing proline residues being either free or hydroxylated within the 1-70 stretch of the fragmented CLR C1q C chain.
Outcome of Top-down Analysis-The top-down analysis of the collagen-like regions of C1q indicated two sites of pepsin cleavage in C1q A (after Phe 95 and Ala 97 ), one site in C1q B (after Phe 97 ), and four sites in C1q C (after Phe 90 , Ser 92 , Val 93 , and Phe 94 ). Top-down analyses of CLR and intact reduced C1q allowed identification of the PTMs decorating each chain (summarized in Fig. 3), some of which (e.g. the glycosylation of Lys 10 /Lys 11 in C1q A) had not been detected by bottom-up analyses. The complexity of the hydroxylation patterns was further illustrated: in particular, fragmentation of CLR C1q C at m/z 1316.38 (supplemental Fig. S7) showed that a biomolecule at a given mass can actually exhibit different combinations of hydroxylation sites. As a whole, the signals detected on intact CLR chains indicated that C1q A contains five Lys# and 6 -9 hydroxylations (for biomolecules at detectable levels), C1q B bears five Lys# and 11-14 hydroxylations, and C1q C contains three Lys# and 12-15 hydroxylations. DISCUSSION We report here for the first time the proteomics analysis of the human complement protein C1q, which is composed of 18 chains from three different polypeptides and exhibits a unique structural organization comprising a collagen-like domain with a high level of post-translational modifications: hydroxylations on Lys/Pro residues, Glc-Gal disaccharides on Lys residues, and a branched N-linked sugar moiety on the globular moiety of the A chains.
Clearly, identifying the lysine residues modified by Glc-Gal motifs was not trivial. It is worth mentioning that unlike for the common N-glycosylation no known enzyme allows removal of these disaccharides and that attempts to chemically eliminate these modifications led to disruption of the polypeptide chains (data not shown). The fact that the loss of sugar cycles (162.05 mass units) was favored during MS/MS, together with the inefficient fragmentation of the peptide backbone, rendered sequence determination by the Mascot software difficult. Nonetheless, we could check from OTMS 2 analyses that the high mass accuracy (Ͻ5 ppm) of the LTQ-Orbitrap pro- vided confident identification of glycosylated sequences from chains A, B, and C. Additionally, we performed LC-MS/MS analyses with MS 3 scans triggered on the heaviest neutral loss of sugar moieties (multiples of 324.1 mass units). This method appeared to be efficient in producing pairs of MS 2 / MS 3 spectra whose precursors differed by the total number of glycosylation motifs present on the initial peptide (for usually 1# and 2#). The obtained MS 3 scans allowed reading a sequence in amino acids with much more confidence than the corresponding MS 2 scans. Such a method programming MS 3 scans triggered on the heaviest sugar loss would be of general interest when studying proteins that contain a collagenlike domain exhibiting glycosylated lysine residues. This method may be more generally applicable to the study of labile oligosaccharide-type modifications. We combined bottom-up and top-down analyses of the C1q and CLR samples to obtain complementary information on the PTMs decorating the three C1q chains. Obtaining a really exhaustive characterization of the variable modification level of the individual chains would have required separation of the intact proteins by LC to systematically acquire MS n information from each species by off-line infusion (39). Nevertheless, our combined bottom-up and top-down data were sufficient to identify Pro/Lys residues that are fully hydroxylated (or glycosylated) in all C1q molecules (given their systematic detection in a modified form in LC-MS/MS analyses) and others that are either unmodified (or solely hydroxylated) in different biomolecules. A variable level of hydroxylation at specific Pro/Lys residues was described previously for other collagen-containing proteins (40 -43). In our case, top-down analyses revealed that proteins with the same sequence can bear between N and N ϩ 4 hydroxylation motifs (with N being 6, 11, and 12 for the C1q A, B, and C chains, respectively). They also showed that proteins with the same sequence and mass can exhibit different distribu-tions of hydroxylation sites, thus highlighting a further level of PTM pattern complexity.
The determination of the primary structure of C1q initially carried out by proteolytic digestion and Edman sequencing yielded the sequences of the C1q A and B chains and that of the 94 N-terminal residues of the C1q C chain (22) and resulted in incomplete identification of the PTMs in the CLR moieties of the chains. The proteomics study reported here allowed us to cover the sequences of the whole C1q A, B, and C chains and to verify the cDNA-derived sequences, confirming that the few discrepancies noticed with the initial proteinderived sequence do not arise from polymorphism. In addition, we made the most of the analytical potentialities of the LTQ-Orbitrap instrument (acquisition of MS 2 and sugar lossbased MS 3 scans in bottom-up analyses and of iterative MS n fragmentations on intact proteins in top-down analyses) to confidently and comprehensively identify glycosylated residues. The large majority of the identified Glc-Gal-bearing hydroxylysines appear to be fully modified (Lys 10 or Lys 11 , Lys 26 , and Lys 45 in C1q A; Lys 32 , Lys 35 , Lys 50 , and Lys 71 in C1q B; and Lys 29 , Lys 44 , and Lys 47 in C1q C), yet a few were also detected as being either glycosylated or hydroxylated (Lys 78 or Lys 81 in C1q A and Lys 83 in C1q B). Finally, only three lysine residues within the CLR have been systematically and unambiguously identified in a non-modified form, i.e. residues Lys 59 in C1q A, Lys 61 in C1q B, and Lys 58 in C1q C. This identification is consistent with the early analyses reported by Reid (22) and represents highly meaningful information with respect to the assembly of the C1 complex.
Given the important function of the C1 complex in the immune system and its role in the triggering of complementmediated inflammation, the understanding of its assembly is key information that, in addition, can be extended to the collectins mannan-binding lectin (MBL) and ficolins, two other important classes of pattern recognition molecules. These collectins, which trigger complement through the lectin pathway, share with C1q the ability to associate in a homologous manner with their partner proteases, MBL-associated serine proteases through their collagen domain (44). Point mutations of recombinant MBL and ficolins have revealed the essential role of a single unmodified lysine residue in their collagen domains for the association with the MBL-associated serine proteases (45,46). Unlike MBL and the ficolins, which are assembled from a single polypeptide chain and thus have been produced recombinantly, so far C1q could not be studied by site-directed mutagenesis to identify the CLR residues involved in the assembly of the C1 complex. Nevertheless, point mutants of C1r and C1s have been produced, providing evidence for the essential role of Asp and Glu residues within the C1r/C1s CUB modules in the interaction of the C1r/C1s tetramer with C1q likely through ionic bonds. Given the homologous assembly of the collectins MBL/ficolins and C1q, we postulate that the free residues C1q A-Lys 59 , C1q B-Lys 61 , and C1q C-Lys 58 are fully accessible and available for interaction with acidic residues contributed by the C1r and C1s CUB modules. In support of this proposal, the location of these three unmodified lysine residues about halfway along the CLR is in agreement with the recently proposed refined model of C1 assembly in which the C1r/ C1s tetramer is positioned inside the cone defined by the C1q stems (6). According to this model, both C1r/C1s CUB1-EGF-CUB2 heterodimers provide six binding sites distributed radially to make contacts with each of the six C1q collagen-like stems. As they are located halfway along the CLR, these unmodified lysine residues are thus in an appropriate position for making individual contacts with the CUB modules and mediating effective interaction with the C1r/C1s tetramer (Fig. 10).
The MS characterization reported here is therefore fully consistent with the proposed model for the C1q-C1r/C1s tetramer assembly (6), accounting for the architecture and function of the C1 complex. Recent data reveal that the role of C1 extends beyond pathogen recognition to include implications in autoimmune diseases, ischemia-reperfusion injury, organ graft rejection, and neurodegeneration (1,47). It is therefore generally considered that the early inhibition of the complement cascade at the C1 level would provide a therapeutic benefit (48,49). In this context, the experimental data reported here on the interface between C1q and its C1r/C1s protease partners provide useful clues for the design of inhibitory molecules aimed at targeting the C1 complex.