Analysis of Automatically Generated Peptide Mass Fingerprints of Cellular Proteins and Antigens from Helicobacter pylori 26695 Separated by Two-dimensional Electrophoresis*

Helicobacter pylori is a causative agent of severe diseases of the gastric tract ranging from chronic gastritis to gastric cancer. Cellular proteins of H. pylori were separated by high resolution two-dimensional gel electrophoresis. A dataset of 384 spots was automatically picked, digested, spotted, and analyzed by matrix-assisted laser desorption ionization mass spectrometry peptide mass fingerprint in triple replicates. This procedure resulted in 960 evaluable mass spectra. Using a new version of our data analysis software MS-Screener we improved identification and tested reliability of automatically generated data by comparing with manually produced data. Antigenic proteins from H. pylori are candidates for vaccines and diagnostic tests. Previous immunoproteomics studies of our group revealed antigen candidates, and 24 of them were now closely analyzed using the MS-Screener software. Only in three spots minor components were found that may have influenced their antigenicities. These findings affirm the value of immunoproteomics as a hypothesis-free approach. Additionally, the protein species distribution of the known antigen GroEL was investigated, dimers of the protein alkyl hydroperoxide reductase were found, and the fragmentation of γ-glutamyltranspeptidase was demonstrated.

peaks is sufficient to identify minor components of spots, which could contribute to a higher coverage of the proteome by the 2-DE/MALDI-MS approach. In a first report we demonstrated the application of a software program named MS-Screener together with cluster analysis starting with an H. pylori dataset of 480 PMFs obtained by manual spot picking, digestion, and peak detection (16).
Here we present data from a procedure with automated spot picking, digestion, peak detection, and database search. For further evaluation we applied a new version of MS-Screener comprising elimination of contaminants, detection of neighbor spot contamination, cluster analysis, and identification of minor components of a spot. The antigenic proteins detected in spots identified in former investigations (10,11) were analyzed in detail to assure that the antigenicity is caused by the formerly identified protein and not by a minor component. As special cases a dimerization of alkyl hydroperoxide reductase (HP1563), the degradation pattern of GroEL, and the fragmentation of ␥-glutamyltranspeptidase were elucidated.

EXPERIMENTAL PROCEDURES
H. pylori Cell Culture and Lysis-Bacteria were grown on agar plates containing 10 g/ml vancomycin. After 3 days single clones were resuspended in 1 ml of brain heart infusion medium containing 10% fetal calf serum, and 20 l of this suspension were grown for 2 days on vancomycin-containing agar plates under microaerobic conditions (5% O 2 , 10% CO 2 , 85% N 2 ) at 37°C. Cells were transferred into 50 ml of cold phosphate-buffered saline containing Complete protease inhibitors (Roche Applied Science). After centrifugation at 3,000 ϫ g and 4°C for 10 min, one wash step in 10 ml of phosphatebuffered saline containing protease inhibitor followed. The pellet of bacteria was diluted with a half-volume of distilled water and lysed by addition of urea, CHAPS, Servalyte, pI 2-4 (Serva, Heidelberg, Germany), and dithiothreitol to obtain final concentrations of 9 M, 1.4%, 2%, and 70 mM, respectively. The suspension was shaken for 30 min at room temperature. Insoluble components were separated by centrifugation at 100,000 ϫ g for 30 min, and supernatants were stored at Ϫ70°C.
Two-dimensional Electrophoresis-H. pylori lysate proteins were separated using a 23 ϫ 30-cm high resolution gel system with a resolution power of up to 5,000 spots (1,17). For the first dimension an ampholyte mix of pI 2-11 was used, no alkylation was performed, and the second dimension ranged from 5-130 kDa. For preparative gels 250 g of protein were loaded, and the gels were stained with Coomassie Brilliant Blue (CBB) G-250 for 5 days (18).
In-gel Digest-In a large section of the 2-DE gel (20 -80 kDa, whole pI range) 384 different spots with staining intensities ranging from very weak to very high were excised in triple replicates. For this purpose a spot cutter (Proteome Works TM , Bio-Rad) with a picker head of 2-mm diameter was used. Cut spots were transferred into 96-well microtiter plates. The tryptic digest with subsequent spotting on a matrix-assisted laser desorption ionization target was carried out automatically with the Ettan TM spot-handling work station (Amersham Biosciences) using the following protocol. The gel pieces were washed twice with 100 l of a solution of 50% CH 3 CN and 50% 50 mM NH 4 HCO 3 for 30 min and washed once with 100 l of 75% CH 3 CN for 10 min. After drying at 37°C for 17 min 10 l of trypsin solution containing 20 ng/l trypsin (Promega, Madison, WI) was added and incubated at 37°C for 120 min. For extraction, gel pieces were covered with 60 l of 0.1% trifluoroacetic acid in 50% CH 3 CN and incubated for 30 min at 40°C. The peptide containing supernatant was transferred into a new microtiter plate, and the extraction was repeated with 40 l of the same solution. The supernatants were dried at 40°C for 220 min. The dry residue was dissolved in 3 l of 0.5% trifluoroacetic acid in 50% CH 3 CN, and 0.4 l of this solution was directly spotted onto the matrix-assisted laser desorption ionization target. Then 0.4 l of a saturated ␣-cyano-4-hydroxycinnamic acid solution in 70% CH 3 CN was added and mixed with the sample by aspirating the mixture five times. The samples were allowed to dry on the target for 10 -15 min before measurement in MALDI-TOF.
MALDI-TOF Mass Spectrometry-The MALDI-TOF measurement was carried out on the 4700 Proteomics Analyzer (Applied Biosystems, Foster City, CA). This instrument is designed for high throughput measurement, being automatically able to measure the samples, calibrate the spectra, and process the data using the 4700 Explorer TM software. The spectra were recorded in a mass range from 900 to 3,700 Da with a focus mass of 2,000 Da. For one main spectrum 20 subspectra with 100 shots/subspectrum were accumulated using a random search pattern. If the autolytic fragment of trypsin with the monoisotopic (M ϩ H) ϩ m/z at 2,211.104 reached a signal-to-noise ratio (S/N) of at least 10, an internal calibration was automatically performed as one-point calibration using this peak. If the automatic mode failed, manual calibration was applied. After calibration peak lists were created by using the "peak-to-mascot" script of the 4700 Explorer TM software. Settings were a mass range from 900 to 3,500 Da, a peak density of 50 peaks/200 Da, a minimal area of 0, and a maximum of 200 peaks/spot. Three different peak lists were created for an S/N ratio of 5, 7, and 10, respectively. For confirmation of selected peaks MALDI-TOF/TOF spectra were recorded manually.
Database Searches-Identification of spots was done via batch mode using the Mascot protein identification system (Matrix Science Ltd., London, UK) in-house applying the recent H. pylori 26695 protein database downloaded from The Institute for Genomic Research (TIGR, www.tigr.org/). Optimal search parameters were 30 ppm peptide mass tolerance, fixed oxidation of methionine, and 1 missed trypsin cleavage. The criterion for reliable identification was a significant Mascot score Ͼ45 (p Ͻ 0.05) (19).
Data Analysis with MS-Screener-To realize an iterative data analysis (16) for large datasets we have developed a new MS-Screener version using the Java 2 standard edition 1.4.1 software development kit (J2SE1.4.1 SDK, java.sun.com/). This Java tool consists of 126 different program classes and was integrated in a user-friendly graphical user interface (GUI). To integrate a plot view the JFreeChart class library, version 0.9.8, was applied (www.jfree.org/jfreechart/index.html). The software runs under LINUX, Solaris, and Microsoft Windows and comprises a setup function for all operating systems. MS-Screener has the ability to import different ASCII file types like .pkm (GRAMS), .pkt, .txt (Data Explorer), and .dta (SEQUEST). Contaminant searches, calculation of half-decimal places, elimination of contaminants, screening of common masses, their rankings, and the generation of matrices to realize hierarchical agglomerative cluster analyses using R (www.r-project.org) can now easily be calculated in one work set. To find common contaminants in the complete dataset a mass tolerance (interval width) of 30 ppm and a threshold of 5% were applied. Masses that exceeded this threshold were eliminated from the peak lists. To calculate the half-decimal place rule an absolute standard deviation of 0.12 Da was applied, and outlier masses were marked and extracted in a separated table. Another function of the MS-Screener allowed us to generate binary or non-binary interval matrices, which include all intensity values of peak intervals as zero/ one or real intensity counts, respectively. In the present study we used an interval width of 30 ppm, and about 1,600 intervals were calculated based on 384 spectra for one gel. Using these matrices hierarchical agglomerative cluster analyses were performed using the statistical programming environment R (www.r-project.org).

RESULTS
Automation of the Identification Process-H. pylori lysate proteins were separated by high resolution 2-DE. A number of spots were selected from a large section of the gel ranging from 20 -80 kDa and over the whole pI range (4.5-10). Staining intensities went from very low to very high. Triple replicates of 384 spots were automatically cut, digested, and measured by MALDI-MS. Automatic processing yielded 88% of spectra calibrated, and an additional 3% of spectra were calibrated manually. This procedure resulted in a dataset of 960 evaluable spectra; one such spectrum is shown in Fig. 1.
An effective identification of a large number of spectra can be done quickly using an automated system. However, search results of spectra with divergent quality are strongly influenced by the parameters applied. Therefore we tested different parameters to find out whether an optimum for our dataset exists. For this purpose nearly 6,000 database searches were performed with peak lists created with S/N for peak detection of 5, 7, and 10. Additionally, different fixed and variable modifications as well as peptide mass tolerances were used. The following parameters produced the highest identification rates: a peak detection with S/N of 7, fixed methionine oxidations, peptide mass tolerance of 30 ppm, and a maximum of one missed trypsin cleavage. Using this set of parameters 547 spectra (equaling 57%) were automatically identified. Due to the triple replicate dataset 75% of different spots were identified at least once.
The automated recording and one-point calibration (using the autolytic trypsin peptide m/z 2,211.104) of mass spectra yielded peak masses that allowed database searches with an optimal mass tolerance of 30 ppm. The error distribution showed a slight slope to negative errors for smaller masses. This could have been avoided by applying a two-point calibration. However, many spectra did not contain a second sufficiently intense autolytic peak of trypsin (e.g. m/z 1,045.6). The described error distribution remained stable over all measured spectra. An interesting finding was that possible doubly charged peaks occurred in rare cases of PMFs. Such peaks were conspicuous with regard to the half-decimal place rule, and isotopic patterns showed peak distances of m/z ϭ 0.5. For example in spot 433d (HP0410) the peptide containing amino acids 31-49 (QQHNNTGESVELHFHYPIK) was found with high intensity as a single-charged peptide with m/z 2,278.1 and as a doubly charged peptide with m/z 1,139.5. This peptide contains three histidines as potential additional proton acceptors.
To evaluate the reliability of the automated identifications we searched for contradictions in identifications of replicate spots. Here only nine spots (3%) were found to be differently identified with significant scores within the three replicates. When looking closer into these contradictions three spots were found to be imprecisely picked in a densely "populated" area. Two spots contained minor components caused by smearing from neighboring spots that were erroneously identified as major components. Futhermore, three spots contained mixtures of proteins where either one of them was identified as the major component. Only one spectrum was erroneously assigned to a protein.
Another way to evaluate the automated procedure was to compare identification data with manually produced data using a Voyager Elite (Applied Biosystems) (20). Sequence coverages and Mascot scores of 28 arbitrarily chosen spots with medium to low staining intensities were compared (Table I). For this purpose automatically acquired (S/N 7, contaminants removed) and manually measured peak lists (manual peak detection, contaminants also removed (16)) were compared using similar database searches with the Mascot protein identification system. Search parameters were similar apart from the peptide mass tolerances (automatic, 30 ppm; manual, 100 ppm) and possible methionine oxidations (automatic, fixed; manual, variable). For medium staining intensities both methods obtained comparable results. For the more interesting weakly stained spots, however, most spots showed more matched peaks, higher sequence coverages, and higher Mascot scores using the manual procedure.
Improvement of Identification, the Search for Spot Contaminants, and Additional Spot Components-To improve identification and detect minor spot components in large datasets we developed the iterative data analysis software MS-Screener (16). The new version integrates tools for contaminant search and removal, calculation and plotting of the decimal places of masses, and calculation of interval matrices to perform clusterings in R (Fig. 2). Here a threshold of 5% (mass

in %), and Mascot scores of spots with medium and low CBB G-250 staining intensities from automatic (best result of the three replicates) and manual measurements
Database searches were automatically performed with the Mascot protein identification system against The Institute for Genomic Research (TIGR) H. pylori 26695 database using contaminant-removed data (for manual data, see Ref. 16). Parameters for automatic and manual data were 30 and 100 ppm, respectively, and fixed and variable methionine oxidations, respectively. TopSpot IDs refer to manually identified spots in our database (www.mpiib-berlin.mpg.de/2D-PAGE/). The number of peaks is given as matched to the protein/total number of peaks in the spectrum (except for the removed contaminants). Searches in which a mix of two proteins was found are marked with *; in these cases the total number of peaks was reduced by the number of peaks matching to the second protein in the mix. Note that sequence coverages of both datasets can be improved by searching with more possible modifications. occurred in Ն48 of 960 evaluable spectra) was used to define contaminant peaks in the dataset ( Fig. 3 and Table II). Sixtyone masses were found to be contaminants; 12 of these were trypsin autolytic peaks, and 12 were matrix cluster peaks (␣-cyano-4-hydroxycinnamic acid, Na ϩ , and K ϩ clusters were outliers of the half-decimal place rule; cluster masses were calculated as described by Keller and Li (21)). Four peaks were unknown outliers of the half-decimal place rule, three peaks belong to the most intense peaks of GroEL, seven were erroneously labeled isotope peaks because of low peak intensities, and the remaining peaks were unknown. It is important to notice that no keratin peaks were found. After removing the 5% most frequently occurring masses in the dataset the identification of spots was improved by 3% to 78% of all spots to be identified at least in one gel (using optimal parameters). A list of all identified spots is found in the supplemental table. As expected after removal of contaminants spot identification was improved for most spots except for very intensely stained ones; three masses of GroEL were mislead-ingly defined as contaminants (Table II). Even though this most widely spread protein was identified in 15 different spots only (3.9% of spots) these three peptide masses occurred in more than 6% of the spectra. Protein Species in the 2-DE Gels-Expression products of many ORFs appear modified as different protein species in the form of several spots within one 2-DE gel. Examples of this phenomenon are the following identified proteins: translation elongation factor EF-Tu (HP1205, four spots), catalase (HP0875, five spots), alkyl hydroperoxide reductase (HP1563, eight spots), urease ␣ subunit (HP0073, nine spots), and chaperone and heat shock protein GroEL (HP0010, 15 spots). On average we found 1.6 different protein species/ORF in our dataset. With 15 different spots GroEL occurred most frequently (Fig. 4, left). Interestingly, three groups of GroEL spots occurred in the 2-DE gels: a main spot group (five spots), one group with lower MW and more acidic pI (two spots), and one group with lower MW and more basic pI (six spots). Evidence was found that the second group is N-terminally truncated  Table II are marked with an arrow. All outlier masses are listed underneath and can be removed from the dataset or exported in ASCII format. because one peptide found in the main spot (amino acids [13][14][15][16][17][18][19][20] was not found in the spectra of this spot group. The third group we assume to be C-terminally truncated GroEL protein species because seven peptides (comprising amino acids 425-522) were not found in the spectra of these spots. In-silico calculated and apparent (according to gel position) MWs and pIs were in good agreement (data not shown).
Even more than the 15 identified spots contained peptides from GroEL (Table III). When we searched all peak lists for the most intense peak from the GroEL main spot (m/z 1,595.9) 33 different spots were found. Using the more rigid criterion to contain at least three of the five most intensive peaks from GroEL (m/z 1,595.9, 947.5, 1,867.8, 1,488.7, and 1,401.6) 24 spots still were found. Three of these five peaks also belong to the contaminants list (1,595.9, 1,867.8, and 1,488.7), i.e. they were found in 7, 6, and 6% of the spectra, respectively. The other two were found in 4% of the spectra and therefore not considered as contaminants. GroEL peptides were distributed widely in the gels; however, the distribution differed from gel to gel. In gels A, B, and D we found 21, 12, and 13 spots that contained three of the five GroEL peptides, respectively (Table III).
The spot 312 was identified to be alkyl hydroperoxide reductase (HP1563); however, the apparent molecular weight of the spot position in the gel was about double the weight of the protein. We therefore assumed this spot to contain dimers of the protein. By comparing the PMFs (Fig. 5) of this spot with the proteins, main spot (413) differences in sequence coverage were seen. Both cysteine residues of the protein (amino acids 49 and 169) were not covered in spot 312; however, both were found in the main spot. First, amino acids 44 -58/59 modified with propionamide (1,816.8), one missed cleavage as well as propionamide (1,973.0), and an additional methyl ester (1,987.0) were found in the PMF. Methyl ester formation is characteristic for our CBB G-250 staining in methanol and occurs frequently. Second, amino acids 155-174 with oxidized methionine and propionamide modification (2,383.1) were found, too. The sequences of peaks 1,973.0 and 2,383.1 were confirmed by MALDI-TOF/TOF. Because of significantly lower peak intensities in the PMF of spot 312 (staining intensity was also much lower), three of the peaks from spot 413 might not be detectable in spot 312; the peak m/z ϭ 1,973.0, however, was more intense than the peak m/z ϭ 1,649.8, which was found to be intense in both spectra. From this it follows that the cysteine 49-containing peptide was not seen in spot 312 and may therefore be involved in a disulfide bond formation to link the dimers. Because the peak of the second cysteine-containing peptide may be lost in the noise in spot 312, it cannot be distinguished whether homo-or heterodimers of the protein are formed.
A pair of spots with different molecular masses and the same identification (␥-glutamyltranspeptidase, Ggt, HP1118) were spots 347 and 494. By comparing the sequences covered by the PMFs (Fig. 6) these spots appeared to be two fragments of the protein whose sequences were mutually exclusive. In-silico MW and pI calculation of the protein fragments with assumed cleavage at amino acid 370 resulted in similar values compared with the spot positions. Spot 347 is positioned at pI/MW coordinates 9.0/40.0, and amino acids 1-370 were calculated to have 9.5/39.8. For spot 494 we found 6.7/20.0 according to position, and 6.3/21.0 was calculated for amino acids 371-567. A spot containing the whole protein (theoretical mass of 61.2 kDa) was not found in the gels. Therefore, we assume that the entire ␥-glutamyltranspeptidase content of the cell is processed into two subunits.
Exploration of H. pylori Antigens-Immunoproteomics is a method where 2-DE blots are incubated with antibodies, e.g. with human sera. Spots that are recognized can be identified using MALDI-MS. However, spots may contain minor components from other proteins or protein species that could have been recognized by highly specific antibodies instead of the main component. With the iterative procedure using MS-Screener and hierarchical clusterings we tested 24 H. pylori 26695 antigens known from previous studies to be differently recognized by patients suffering from diseases caused by   Matrix peaks were assigned according to calculated ␣-cyano-4hydroxycinnamic acid clusters (21) and trypsin peaks as described previously (16,23,27). HDPR outliers (italic) were masses that did not follow the half-decimal place rule. The three peaks assigned to GroEL belong to the most intense peaks of the spectrum of the GroEL main spot. Isotope peaks were erroneously labeled by the automatic algorithm of peak-to-Mascot when peak intensities were very low. Note that no keratin peaks were found. HDPR, half-decimal place rule. H. pylori (gastritis, duodenal ulcer, and gastric carcinoma) or antigens known to be protective against H. pylori challenge in mice (Table IV) (10,11,22). Six antigens did not contain any reproducible peaks that were not assigned to the identified protein. With respect to the sensitivity of our method these spots can be assumed to be free of minor components. Nine spots contained peaks that could not be assigned to another protein; they may origin from modified peptides, unspecific cleavages, or unknown minor components. In the remaining nine spots peaks supposedly originating from a different protein were found (six of which were from a neighbor spot Ͻ1 cm apart). Apart from HP1533 and HP0380 all these minor components are known to be antigenic and were therefore further investigated as to whether they may have influenced the antigenicities of these spots. In immunoblots incubated with sera of H. pylori-infected patients we explored whether the spots with possible antigenic minor components were recognized concurrently with the main spots of these components. Evidence was found that antigen recognition of three spots might have been influenced by the minor component  (spots 154, 278, and 279). For the other four spots no evidence for such an influence was found.
The antigenic protein GroEL (HP0010) was identified in 15 different spots in our dataset (see above and Fig. 4). Interestingly all of these spots were conjointly recognized by human sera from H. pylori-infected individuals. Searching the immunoblots for recognition of the 24 spots that contained three of the five most intense GroEL peaks (Table III) we found that all apart from three (spots 313, 314, and 372) were recognized conjointly by antibodies in human sera.
Completion of the Proteome of H. pylori 26695-Another aspect of this study was the continuation of the proteome exploration of H. pylori. Here we identified 298 spots (78% of the spots measured), which represent 183 different ORFs. Twenty-four of these ORFs have not been identified before as compared with the dataset of our group to be published (Table V). Among these, four spot identifications conflict with the manual results presumably because of spot-picking tolerances in densely spotted areas or because spots contain protein mixtures. DISCUSSION For this study an automatically generated dataset was used to compare identification results with our manual procedure, to exhaustively investigate protein distributions in 2-DE gels, and to affirm the immunoproteomics approach used to identify antigen candidates. Here we investigated a large dataset covering about two-thirds of visible spots (384 spots) of our CBB G-250-stained H. pylori 26695 2-DE gel. These spots were picked in triple replicates, digested, and measured automatically, and they resulted in a dataset of 960 evaluable spectra.
To take the most advantage of such a dataset it was shown to be helpful to optimize the peak detection and identification parameters. Not too many possible modifications should be used because Mascot scores will fall with increasing amounts of possible peptide masses. Even more important, however, is to take advantage of the recording of replicate datasets, which will further improve the rate of identification considerably. Performing searches in triple replicates, we were able to achieve an identification rate of 75% for spots that were finally identified in at least one gel.
A good approach to assess the reliability of automatic identifications is the search for contradictions in identifications of replicate spots. Such differences may be caused by spot-picking tolerances, incidental differences in the auto- Marked with x are spots in the corresponding gel(s) that contain at least three out of the five most intense GroEL peaks from the main spot (most intense peaks in descending order: 1,595.9, 947.5, 1,867.8, 1,488.7, and 1,401.6). Shaded in gray are spots that were unambiguously identified to contain GroEL as main component. The o means that this spot was identified to contain a protein different from GroEL (spot 154 was in gel A a mix of two proteins); others were not identified. The search was performed using MS-Screener with peak lists including contaminants. matic procedure, or unsuited database search parameters. We found only 3% of spots to be inconsistently identified in the three replicate datasets. Many of these spots were posi-tioned in densely spotted areas, and their inconsistent identification may therefore be more a problem of picking or of protein mixtures in spots than of erroneous database search results. Spots laying side-by-side and containing different proteins will merge into one another even when the protein concentration in the merging zone is below the detection limit of the staining. Small variances in spot picking can in such cases coincidentally cause different identifications for one spot. The same is true for spots that contain mixtures of proteins with similar concentrations. Only one of 298 identifications was incorrect, which shows that identification was highly reliable. Additionally, the use of the exclusive identification criterion of a significant Mascot score of 45 (for use of The Institute for Genomic Research (TIGR) H. pylori 26695 database, p Ͻ 0.05) appeared to be trustworthy. It was not necessary to consider sequence coverages or number of matched peptides. The fact that only 3% of spots were inconsistently identified showed also the high reproducibility of spot patterns in our 2-DE gels.
An important aspect of this study was the comparison of automatic and manual procedures of data generation and identification. We have chosen 28 exemplary spots, which were identified automatically as well as manually (Table I). By comparing these results it became evident that differences for medial stained spots were negligible, whereas differences between identification of faint spots were noticeable. After removal of contaminant peaks (discussed below) manually generated spectra of faint spots contained on average more peaks, and also more peaks were matched to the given protein. The same holds true for sequence coverages and Mascot scores. These results were probably caused by the fact that the manual procedure could be adapted for individual spots with low protein contents. It is quite evident that automatic procedures may not be adjusted to all the spots in a gel where protein contents differ by several orders of magnitude. Consequently, manual measurements and data analyses are still powerful means to investigate faint spots.
We have developed the software MS-Screener, which not only is able to improve identification but also can be used for

TABLE IV Antigenic spots tested for minor spot components
The spots listed are known antigens of H. pylori 26695 (10,11,22). Loci marked with # were identified to be HP0027 by Haas et al. (10) (spots lie in a very dense region). Unknown peaks are reproducible peaks (at least in two of three replicates) that were neither assigned to the main component of the spot nor to a protein close in the dendrogram cluster. None means that all reproducible peaks were assigned to the identified protein in this spot. Those marked with * are minor components that might have influenced the antigenicity of this spot. TIGR, The Institute for Genomic Research; hypoth., hypothetical.

Spot no.
TopSpot ID TIGR locus Short name Minor components exhaustive data analysis. The new user-friendly graphical user interface allows the import of data in the form of ASCII files, calculates and removes contaminants, calculates and plots half-decimal places, screens and ranks for certain masses in spectra, and enables the generation of intervalized peak intensity matrices for further statistical analyses (Fig. 2). This tool was successfully utilized to improve identification, to analyze protein distributions, and to find minor components especially in H. pylori antigens as discussed below. The removal of contaminants resulted in an improvement of the identification rate by 3% to 78% of spectra that were identified at least in one of the replicate gels. Sixty-one masses were found in Ն5% of the 960 spectra and were therefore defined to be contaminants, i.e. peaks that were not specific for a certain spot (Table II and Fig. 3). Contaminant masses were assigned to be matrix clusters, trypsin autolytic products, or peptides from the most frequently found protein GroEL. Seven masses were erroneously labeled isotope peaks. For peaks with low intensity the peak-labeling algorithm of peak-to-Mascot picked the more intense second isotopic peak instead of the monoisotopic. In these cases the monoisotopic masses were found in other spectra and appear also in the contaminants list. Although the source of the other contaminant masses is unknown, no keratin peaks were found. In our previous study (16), in which in 480 manually acquired and analyzed PMFs 69 contaminant masses in the comparable mass range of 900 -3,500 Da were found, 47 masses were assigned to keratins. In another recent study of 118 spectra (23), 71 contaminants in the range of 900 -3,500 Da were found in Ն5% of spectra, and 53 of these were keratins. These results show that although a comparable number of contaminants were found the use of fully automated spot picking, digesting, and spotting can be highly efficient to avoid contaminations with keratin.
The fact that one spot does not contain one protein but rather one protein can be distributed in several spots in the form of different protein species is well illustrated by the heat shock protein GroEL. This most widely distributed protein in our dataset was identified in 15 different spots (Fig. 4, left). Within these spots evidence was found that two were Nterminally truncated and that six were C-terminally truncated. The reasons for the exact spot positions within these groups (modifications, differences in lengths of truncations, or conformational differences) were not figured out. A further nine spots were found to contain three of the five most intense peptide masses of GroEL (Table III) and may therefore most likely contain low amounts of GroEL. In six spots GroEL was a minor component because they were identified to contain a different protein, and in three spots (not identified) this protein may be a minor or major component. These findings raise the question as to whether minor components represent co-migrating proteins, e.g. by protein-protein interactions during electrophoresis or in vivo, or represent just contaminations. It is important to notice that the GroEL peptide distribution was not fully reproduced within the three replicates. This might be caused by differences among the gel runs, or it might be a consequence of the low GroEL content within these spots so that in some cases these peptides might have fallen below the detection limit. Another possibility was that the criterion to find at least three out of the five most intense GroEL peaks was not rigid enough and that not all of these spots truly contain this protein. According to the identification results on average 1.6 spots/ORF were found in our dataset.
The protein alkyl hydroperoxide reductase (HP1563) was found in eight different spots. According to the position in the gel one spot had an apparent molecular weight that was double the weight of the main spot of the protein. Evidence was found that this spot contained dimers of the protein because cysteine-containing peptides were not found in the dimer spot (Fig. 5). This finding raised the question of whether these dimers exist in vivo or were artifacts of the two-dimensional gel electrophoresis. Artificial dimerization during the run of the second dimension can be ruled out because there was no smearing to be seen on the gels. As dimerization has little effect on the pI it could have taken place during the first dimension when the active concentration of the reducing agent dithiothreitol decreased. Alternatively, dimers may have been formed in vivo, and the concentration of dithiothreitol in the sample buffer was not sufficient to reduce all disulfide bonds because only a small part of the protein content of the main spot, according to the staining intensities, was found to be dimerized. The fact that no dimers were found from other proteins supports the idea that dimerization could have taken place in vivo. This finding is also verified by the fact that other members of the peroxiredoxin family form homodimers or even decamers (24).
The protein ␥-glutamyltranspeptidase (HP1118) was identified in two distinct spots, which were positioned far apart. The PMF-covered amino acid sequences of these spots were exclusive; their combined apparent masses added up to the theoretical mass calculated from the ORF so that we concluded that two fragments occurred (Fig. 6). Although both spots were only weakly antigenic in our immunoblots (11) the protein is known to be a virulence-and apoptosis-inducing factor of H. pylori that occurs in the form of two fragments (25,26). Additionally, it was hypothesized that the protein is membrane-associated (25), and here the first 36 amino acids were not covered in the PMFs so that a cleavage of a signal sequence might have occurred.
A certain protein can not only be found in several spots, but a spot can also contain several proteins in the form of protein mixtures (similar amounts of protein), as minor components, or in the form of neighbor spot contaminants. In immunoproteomics antibody recognition of proteins separated on 2-DE blots is detected. Because highly specific antibodies may recognize very small amounts of protein it cannot be ruled out that minor components of spots might be detected instead of the major component. Therefore, one has to be sensitive to the identification of such antigens. Here we closely investigated 24 known antigenic spots as to whether they contained minor components using MS-Screener and hierarchical clustering (Table IV). Nine spots possibly contain other components, six were supposedly free of such components, and a further nine contained unknown peaks. From the nine spots first mentioned seven contained peptide masses from known antigens. However, only three spots showed concurrent recognition of the spot and the main spot of its minor component in our immunoblots (11) and might therefore have had an influence on the antigenicity. Two of these spots contain major components that were also found in other "clean" antigenic spots. Consequently, only one protein (spot 278, protease HP1012) remains that could have erroneously been assigned to be antigenic.
As mentioned above, the antigenic protein GroEL was identified in 15 different spots, and three of the five most intense peaks were found in a further nine spots. Twenty-one of these spots were recognized conjointly in the immunoblots (see Fig.  4 for the 15 GroEL-identified spots). For these spots no evidence for differential antigenicities of different protein species was found.
In our recent study (11) we identified five different groups of patients by hierarchical clusterings of immunoblot data. One criterion for the definition of two groups was the recognition of a spot cohort (spots 225, 226, 231, 232, 233, and 234), which was now identified to contain species of GroEL that are supposedly C-terminally truncated. For the reason that GroEL is a highly conserved antigen and that all of its known protein species were conjointly recognized by the sera of the patients, the biological relevance of these two patient groups remains unclear. Spot 154, which was a candidate for differential immunogenicity of different protein species of AtpA in the study mentioned above, was here identified to contain a mix of GroEL and AtpA. Several spots in this region (spots 154 -157) lie side-by-side and contain either one of these proteins or mixtures of both so that in this case the identification that depends on spot assignment between immunoblots and 2-DE gels remains uncertain. A differentiation between GroEL and AtpA could be obtained by incubation of recombinant proteins with patient sera.
A dataset of 960 PMFs was used to compare automatic and manual data acquisition and investigate protein distributions in 2-DE gels. Large datasets can quickly be generated and identified with automatic procedures. For this purpose it is highly recommended to investigate replicate datasets to raise the rate of identification and improve reliability. Additionally, optimization of peak detection and database search parameters as well as calculation and removal of contaminants were shown to be advantageous. Manual measurements are still up-to-date especially for faint spots where procedures can be adapted individually. In addition we confirmed that immuno proteomics is a powerful hypothesis-free approach to find antigen candidates given that spot identification is performed cautiously.