Pooled ORF Expression Technology (POET)

We have developed a pooled ORF expression technology, POET, that uses recombinational cloning and proteomic methods (two-dimensional gel electrophoresis and mass spectrometry) to identify ORFs that when expressed are likely to yield high levels of soluble, purified protein. Because the method works on pools of ORFs, the procedures needed to subclone, express, purify, and assay protein expression for hundreds of clones are greatly simplified. Small scale expression and purification of 12 positive clones identified by POET from a pool of 688 Caenorhabditis elegans ORFs expressed in Escherichia coli yielded on average 6 times as much protein as 12 negative clones. Larger scale expression and purification of six of the positive clones yielded 47–374 mg of purified protein/liter. Using POET, pools of ORFs can be constructed, and the pools of the resulting proteins can be analyzed and manipulated to rapidly acquire information about the attributes of hundreds of proteins simultaneously.

Projects aiming to convert the thousands of genes made accessible by genomic sequences into their corresponding proteins have met with limited success (1) despite the expenditure of significant resources (2,3). Expression of recombinant proteins in Escherichia coli, the primary host organism for high throughput applications, has been especially unsuccessful for metazoan proteins. For example, one effort directed at producing Caenorhabditis elegans proteins successfully purified only about 2% of those attempted (4).
In the standard approach to high throughput protein ex-pression and purification used in these programs, genes are cloned individually into an expression vector, introduced into an expression host, and expressed in separate cultures. Each culture is subsequently tested for expression and solubility of the corresponding protein at which point new cultures of positive clones are grown and induced so that protein purification can be attempted. Even with intensive use of robotics, the logistics and costs of this strategy are considerable when thousands of genes are put into such a pipeline.
Here we describe a method called pooled ORF expression technology (POET) 1 that avoids many of the logistical issues associated with high throughput protein expression and purification. POET combines recombinational cloning and collections of sequenced ORFs with proteomic methods (twodimensional gel electrophoreses (2DGE) and MS) to predict which ORFs in a pool will yield soluble, purified protein. We applied POET to a pool of 688 C. elegans ORFs. A high percentage of ORFs identified in this experiment yielded expressed, soluble, purified proteins in agreement with POET predictions.

EXPERIMENTAL PROCEDURES
ORFs-The C. elegans ORFeome version 1.1 has been described previously (5). The predicted initiating methionine of each ORF was changed to leucine (ATG 3 TTG), and the stop codon of each ORF was omitted. The DNA concentrations of the 752 Gateway entry clones used in this experiment were determined by PicoGreen (Molecular Probes) fluorescence and used to calculate the molar concentration of each plasmid (based on the size of each ORF and the size of the pDONR201 backbone), which ranged from 0 to 8.73 nM. All wells containing Ͻ0.15 nM plasmid concentration were omitted, leaving 688 ORFs. Plasmid DNAs were pooled in bins of 2-fold concentration range, starting with the most concentrated plasmid and going down 2-fold, 4-fold, etc. for a total of six subpools. These subpools were combined volumetrically, 1 volume of the most concentrated subpool, 2 volumes of the next most concentrated subpool, etc. The final pool was ethanol-precipitated and dissolved in Tris-EDTA to a final concentration of 2.5 ng/l. The result of these manipulations was a single pool in which no plasmid varied in molar concentration from any other plasmid by more than 2-fold.
Protein Expression-The 688 pooled ORFs (as Gateway attL entry clones (6)) were subcloned into pDest527 (T7 promoter, amino-terminal His 6 fusion) with LR Clonase (Invitrogen) according to the manufacturer's instructions except that the reaction was allowed to proceed for 5 h at 30°C. Each ORF expressed in this vector contained the sequence MRSGSHHHHHHRSDITSLYKKAG added to its amino end and YPAFLYKVVISLAR added to its carboxyl end due to the lack of the native stop codon. Reaction products were transformed into DH5␣ cells (Invitrogen), and 1% of the SOC expression mixture was plated on ampicillin (145 colonies). The remaining 99% of the expression mixture was added to 50 ml of CircleGrow (QBiogene) containing ampicillin, and after overnight growth at 37°C plasmid DNA was purified (Brinkmann Fast Plasmid). About 100 ng of pooled expression plasmids were electroporated into E. coli Rosetta (DE3) strain (Novagen), which compensates for eukaryotic codons that are rare in E. coli. Colonies from an aliquot of this reaction indicated that about 1.4 million transformants resulted. The 1 ml of SOC expression mixture was diluted into 50 ml of CircleGrow (containing 100 g/ml ampicillin) and grown overnight at 37°C. The overnight culture was diluted 1:100 into 1 liter of CircleGrow (containing 100 g/ml ampicillin), grown at 37°C to an A 600 of 0.5, and cooled to 16°C, and protein expression was induced by adding isopropyl 1-thio-␤-D-galactopyranoside to a final concentration of 0.5 mM. After 16 h at 16°C cells were harvested and frozen at Ϫ80°C.
Protein Purification-All steps were performed at 4°C unless otherwise noted. E. coli cell pastes were resuspended with 2 volumes of extraction buffer/g of wet weight for a final concentration of 20 mM sodium phosphate buffer, pH 7.5, 100 mM NaCl, 5 mM MgCl 2 , 5% glycerol, 45 mM imidazole, and Complete protease inhibitor-EDTA (Roche Applied Science) at one tablet/50 ml of extract. Extracts were treated with lysozyme (0.5 mg/ml) for 30 min and with Benzonase (Novagen, 10 units/ml) for an additional 20 min. Samples were sonicated to lyse the cells (verified by microscopic examination), adjusted to 500 mM NaCl with solid NaCl, centrifuged at 111,000 ϫ g for 30 min, filtered (0.45 m, polyethersulfone membrane) and applied at 0.6 ml/min to 1-ml HisTrap columns (Amersham Biosciences) equilibrated with extraction buffer in 500 mM NaCl and 45 mM imidazole (binding buffer). The columns were washed with binding buffer until the levels of protein flowing through the column reached base line, and bound proteins were eluted with binding buffer ϩ 500 mM imidazole, collected in 1-ml fractions, and analyzed by SDS-PAGE. The pools created from the IMAC fractions were precipitated by adding 25% (v/v) TCA to 6% (v/v) final concentration, vortexed, incubated on ice for 5 min, and centrifuged at 16,100 ϫ g for 10 min. The supernatant was removed, and the pellet was incubated with ice-cold acetone for 5 min on ice and then centrifuged at 16,100 ϫ g for 5 min. The supernatant was discarded, and the pellet was dried for 2 min at 70°C, dissolved in room temperature solubilization buffer (8 M urea, 4% CHAPS, 50 mM Tris, pH 8.5) to a concentration of 20 mg/ml (Bio-Rad protein assay), and stored in 50-l aliquots at Ϫ80°C.
Two-dimensional Gel Electrophoresis-Two-dimensional PAGE of 200 -1000 g of pooled protein was performed according to the procedure of O'Farrell (7) with more recent modifications (8,9). In the first dimension, isoelectric focusing was accomplished by IPG using the Amersham Biosciences IPGphor isoelectric focusing system. Affinity-purified samples were dissolved in 450 l of rehydration buffer (8 M urea, 2% CHAPS, 7 mg DTT, and a trace of bromphenol blue). The rehydration buffer-protein mixture was placed in a 24-cm ceramic strip holder, and a 24-cm IPG strip was glided, gel side down, into the strip holder. Mineral oil was placed on top of the gel to minimize evaporation and covered with the strip holder plastic cover. The ceramic strip holder was placed in the IPGphor unit for isoelectric focusing and set at 30 V for 12 h, 500 V for 1 h, 1000 V for 1 h, and 8000 V for 8 h.
Prior to the second dimension step of separation by molecular weight, the IPG strip was equilibrated with an SDS buffer system. The equilibration solution contained 50 mM Tris-HCl, pH 8.8, 6 M urea, 30% glycerol, 2% SDS, and a trace of bromphenol blue. Prior to use, 100 mg of DTT was added in 10 ml of equilibration buffer. The IPG strips were placed in individual tubes containing the buffer. The tubes were then placed on a rocker and equilibrated for 12 min. A second equilibration was performed with 250 mg of iodoacetamide solution (instead of DTT) and incubated for another 12 min. The equilibrated IPG strip was then inserted into a cassette containing a precast Ettan DALT II 12.5% polyacrylamide gel, and contact was made with the gel. Enough melted agarose was added to cover the IPG strip. The 2DGE chamber was filled with anode buffer (0.5 M diethanolamine, 0.5 M acetic acid). Cathode buffer (0.1% SDS, 0.192 M glycine, 0.025 M Tris) was added to the top chamber. The running conditions were set in the power supply (phase 1, 5 watts/gel/15 min; phase 2, 150 watts/gel), and electrophoresis continued until the bromphenol blue dye front reached the bottom of the gel (ϳ4 -5 h). Once the dye front reached the end of the gel, the cassettes were removed, and the gels were placed in Coomassie Brilliant Blue staining solution (25% isopropanol, 10% acetic acid, 0.05% R250 Brilliant Blue) overnight. Gels were then placed in destain solution (30% methanol, 10% acetic acid). All staining/destaining procedures were carried out in glass trays placed on a slowly oscillating rocker table.
Spot Picking, Digestion, and Analysis-The spots on the 2D PAGE gel were numbered 1-170, and small pieces were retrieved from the center of each spot. Coomassie Blue-stained protein gel spots were digested with trypsin as described previously (8). Samples were desalted with C 18 Zip Tips (Millipore, Bedford, MA) according to the manufacturer's protocols prior to MS analysis. Chromatographic separations of desalted tryptic peptides were conducted using a 75-minner diameter ϫ 360-m-outer diameter ϫ 10-cm-long fused silica capillary column (Polymicro Technologies Inc., Phoenix, AZ) with one end flame-pulled to a fine tip (ϳ5-7-m orifice). The column was slurry-packed in-house with 3-m, 300-Å pore size C 18 stationary phase (Vydac, Hercules, CA). Nanoflow reversed-phase LC was performed using an Agilent 1100 nanoflow LC system (Agilent Technologies, Palo Alto, CA) coupled on line to a linear ion trap (LIT) mass spectrometer (LTQ, ThermoElectron, San Jose, CA). Reversed-phase separations were conducted after injecting 5 l of sample for each analysis. The columns were connected via a stainless steel union to an Agilent 1100 nanoflow LC system (Agilent Technologies), which was used to deliver solvents A (0.1% HCOOH in water) and B (0.1% HCOOH in CH 3 CN). After sample injection, a 20-min wash with 98% mobile phase A was applied, and peptides were eluted using a linear gradient of 2% mobile phase B to 42% solvent B over 40 min with a constant flow rate of 200 nl/min. The column was washed for 15 min with 98% mobile phase B and re-equilibrated with 98% mobile phase A prior to subsequent sample loading.
The nanoflow reversed-phase LC column was coupled on line to a LIT mass spectrometer using the manufacturer's nanoelectrospray source with an applied electrospray potential of 1.5 kV and capillary temperature of 160°C. The LIT mass spectrometer was operated in a data-dependent mode where each full MS scan was followed by five MS/MS scans in which the five most abundant peptide molecular ions detected from the MS scan were dynamically selected for five subsequent MS/MS scans using a CID energy of 35%. The CID spectra were analyzed using SEQUEST operating on a Beowulf 18-node parallel virtual machine cluster computer (ThermoElectron) using a combined non-redundant C. elegans, E. coli proteome database (www.expasy.org). Only peptides with conventional tryptic termini (allowing for up to two internal missed cleavages) possessing ⌬ correlation scores (⌬C n ) Ͼ0.08 and charge state-dependent cross-correlation (X corr ) criteria as follows were considered as legitimate identifications: Ͼ1.9 for ϩ1 charged peptides, Ͼ2.2 for ϩ2 charged peptides, and Ͼ3.1 for ϩ3 charged peptides.
Testing of Predicted Positive ORFs-Based on visual inspection of the 2D gel and mass spectrometer identifications of the 165 total spots, 12 individual ORFs were retrieved from the ORF plates. These were subcloned into pDest527 and expressed in E. coli Rosetta in 700-l cultures in a 24-well dish to an A 600 of 0.5, then transferred to 17 ϫ 100-mm polypropylene tubes (Falcon 2059), cooled to 16°C, induced with isopropyl 1-thio-␤-D-galactopyranoside (0.5 mM), and expressed overnight at 16°C. To determine the fraction of soluble and insoluble protein, cells were lysed with detergent (ReadyPreps, Epicenter), and soluble and insoluble fractions were applied to SDS-PAGE. Recombinant His 6 fusion proteins were purified from the soluble fractions with Swell Gel beads (Pierce) and spin columns. Six ORFs chosen from this small scale experiment were grown at 1-liter scale and purified using the same procedure as the pool of 688 (above). Concentrations of the proteins in the small and large scale preparations were determined using the Bio-Rad protein assay. Fig. 1. Hundreds of ORFs are pooled, and the pooled ORFs are subcloned en masse into a protein expression vector supplying an affinity purification tag. The resulting pool of expression plasmids is introduced into an appropriate host and expressed in a single culture of host cells, and the tagged expressed proteins are purified away from host proteins. (For a culture of volume V containing n ORFs, the protein from any particular ORF is derived from V/n cells, whereas host proteins are derived from V cells. Thus from mass considerations host proteins are much more abundant than proteins from individual ORFs.) The mixture of ORF proteins is then separated by 2DGE, and individual proteins are identified by MS. Proteins in intensely staining spots are predicted to be expressed as abundant, soluble proteins that can be easily purified. These predictions are confirmed by retrieving and expressing individual ORFs from the original ORF collection.

Description of POET-The POET scheme is shown in
Subcloning of Pooled ORFs-A pool of 688 ORFs in the form of Gateway entry clones (6) was created from the C. elegans ORFeome (5). Wide variations in ORF DNA concentration were corrected by first creating subpools of clones having similar concentrations and then combining the subpools volumetrically (overall molar variation Յ2-fold). The pooled ORFs were subcloned via a Gateway LR reaction into an E. coli expression vector (pDest527) that added a hexahistidine (His 6 ) tag to the amino terminus of the protein expressed from each ORF. The LR reaction was transformed into a non-expression E. coli strain (DH5␣), and purified DNA from this transformation was then transformed into E. coli strain Rosetta (DE3) for subsequent protein expression.
Protein Expression, Purification, and Identification-Proteins were expressed from the pooled ORFs at 16°C (1-liter culture, equivalent to 1.45 ml for each ORF in the pool), the E. coli cells were lysed by sonication, and soluble proteins were purified by IMAC. The purified protein pool was precipitated with acid, dissolved in urea/CHAPS buffer, and resolved by 2DGE (Fig. 2). A large number of spots on the gel were identified by MS to understand more fully parameters important to the POET process. The most intense spots on the gel were identified as E. coli proteins DnaK, GroEL, SlyD, and OmpF (Fig. 2). Because the isoelectric focusing range on the gel was pH 4 -7, about 200 of the 688 C. elegans proteins were predicted to appear on the 2D gel based on their calculated pI values. A total of 50 C. elegans proteins and 37 E. coli proteins were identified by MS of 165 spots selected from the gel for analysis (see the supplemental table).
Small Scale Verification of POET Results-Twelve C. elegans proteins were chosen from the 2D gel in a blinded fashion using only the number of spots and their intensity and purity (i.e. spots containing more than one protein were given less weight) as selection parameters (Table I) and not being identified in the 2D gel spots that were examined. The 24 ORFs corresponding to these proteins were cloned individually into the same E. coli expression vector (amino His 6 fusions) described above and expressed in 700-l cultures at 16°C, and the resulting proteins were affinity-purified with spin columns. Total and soluble proteins from each culture (Fig. 3, a and b) and comparison of the purified proteins ( Fig. 3c and Table I) validate the predictions of the POET method. Proteins from the positive clones were much more likely to be soluble (Fig. 3, a versus b) and give abundant purified protein (Fig. 3c) than the negative clones. An average of 6 times as much protein was recovered from the positive clones as from the negative clones (Table I). Most of the proteins migrated in accord with their predicted molecular weights. Proteins in lanes 2 ("protein with tau-like repeats, isoform a") and 6 ("hypothetical protein F09G2.9") had extensive predicted hydrophobic regions that may account for their abnormally low electrophoretic mobilities in both the 2D (Fig.  2) and one-dimensional (Fig. 3) gels. We speculate that the "negative" protein in lane 13 was not identified on the 2D gel because its solubility was low at its isoelectric point, and it failed to leave the isoelectric focusing strip.
Large Scale Verification-To verify that the small scale experiments could predict successful larger scale behavior, six of the positive ORFs were expressed in E. coli individually in 1-liter cultures. Soluble proteins were released from cells by sonication and ultracentrifugation and purified on preparative  Table I for protein identities.

DISCUSSION
POET is a procedure for finding which ORFs in a collection of hundreds of ORFs can be most efficiently converted, by cloning, expression, and purification, into their corresponding recombinant proteins. By first combining n ORFs into a single pool, tasks that are difficult to accomplish hundreds of times (transformation, plating, colony picking, culture, induction, lysis, assays of solubility, and purification) are reduced in number n-fold. The problem that then arises, of course, is how to identify which proteins in the purified pool are the most abundant. MS can identify proteins with spectacular sensitivity, but it is also dramatically non-quantitative (due to the unpredictable ionization of peptides of different amino acid sequence). Thus MS alone, although it can identify all the proteins in the pool, cannot distinguish between the most abundant and least abundant proteins in that pool. (Isotopic labeling methods such as ICAT (10) are not useful, because they give relative quantitation of the same protein in two different samples, whereas POET requires relative quantitation of many different proteins in one sample.) We chose 2DGE to determine abundance of the expressed, purified proteins in our POET pool. 2D gels have limitations of size and pI, and running them is not trivial. But they can resolve thousands of proteins, and often individual spots contain a single protein (see the supplemental table). For the purposes of the present study we assumed that the size and intensity of stained spots was a reasonable indicator of the abundance of each protein. Combinations of liquid chromatography columns could be used to generate dozens of fractions of the protein pool, but few if any fractions would contain single proteins, and quantitation would thus be unsatisfactory.
We wished to identify a large number of the spots on the 2D gel so that we could understand the parameters of the POET experiment more completely. Many of the highly abundant spots that were identified were E. coli stress response pro-teins that would not require reidentification in the analysis of subsequent ORF pools, allowing the focus to be only on new spots from C. elegans clones. Because many of the manual steps at the end of POET can be automated (spot identification, picking, digestion, MALDI plate preparation, and MALDI-TOF/TOF peptide identification), we estimate the net efficiency gain for POET to be 10 -100-fold when compared with automated or manual one-by-one methods, respectively.
Although the POET scheme is straightforward, we recognize there are underlying assumptions that can affect its results. 1) All the ORFs in a pool should retain their representation during subcloning and subsequent transfer into expression hosts. Recombinational cloning appears to be essential to minimize size bias and maximize efficiency. Loss of some clones due to toxic effects of ORF expression will tend to increase the representation of remaining clones on the 2D gels. 2) POET assumes that the intensities of spots on 2D gels reflect the amounts of each protein in the purified pool prior to electrophoresis. However, not all proteins remain soluble during isoelectric focusing, and some proteins migrate as multiple spots. 3) Soluble proteins may interact as they are released from cells. The assumption is that as the pool is purified the effect of any one recombinant protein on the behavior of any other member of the pool is small. 4) More than one ORF may be expressed in a particular host cell. POET assumes that coexpression of any two ORFs in the same cell will be distributed more or less randomly among all the ORFs in the pool, and the influence of any one ORF on the behavior of any other ORF is small.
We foresee numerous applications of POET. 1) The large data sets that can be produced by POET experiments could provide researchers with a priori knowledge of what is the most appropriate context to express and purify any candidate protein of interest. 2) Because the solubility of overexpressed proteins is often low, one could take the insoluble fraction of proteins from a POET experiment, divide the insoluble proteins into aliquots, subject each aliquot to a different refolding regimen, and identify which procedure works best for any protein in the pool, thus obtaining hundreds of results for each refolding protocol. 3) Small differences in amino acid sequence can cause vastly different behaviors of proteins during overexpression and purification. Using POET, proteins from mouse, rat, and other model organisms can be attempted if purification of the homologous human proteins fail. 4) Because membrane proteins are difficult to extract and purify, ORF pools comprising the extracellular and/or intracellular domains of hundreds of membrane proteins could be constructed, expressed, purified, and analyzed in POET experiments. 5) The optimal number of ORFs in a pool can be adjusted in conjunction with protein expression, purification, and separation parameters. Clearly the larger the number of ORFs in each pool, the fewer the number of overall experiments are required. However, as the number of ORFs increases the average intensity of each ORF spot on the 2D gel  Table I. decreases, and the intensities of ORF spots decrease compared with spots from expression host proteins. Analysis of relatively large pools of ORFs should be improved by comparison of multiple pools that contain unrelated ORFs because host proteins in each pool are relatively constant and can be ignored.