Identification of Quantitative Trait Loci Underlying Proteome Variation in Human Lymphoblastoid Cells*

Population-based variability in protein expression patterns, especially in humans, is often observed but poorly understood. Moreover, very little is known about how interindividual genetic variation contributes to protein expression patterns. To begin to address this, we describe elements of technical and biological variations contributing to expression of 544 proteins in a population of 24 individual human lymphoblastoid cell lines that have been extensively genotyped as part of the International HapMap Project. We determined that expression levels of 10% of the proteins were tightly correlated to cell doubling rates. Using the publicly available genotypes for these lymphoblastoid cell lines, we applied a genetic association approach to identify quantitative trait loci associated with protein expression variation. Results identified 24 protein forms corresponding to 15 proteins for which genetic elements were responsible for >50% of the expression variation. The genetic variation associated with protein expression levels were located in cis with the gene coding for the transcript of the protein for 19 of these protein forms. Four of the genetic elements identified were coding non-synonymous single nucleotide polymorphisms that resulted in migration pattern changes in the two-dimensional gel. This is the first description of large scale proteomics analysis demonstrating the direct relationship between genome and proteome variations in human cells.

Recent research has shown that gene expression variation in both humans and model organisms behaves as a complex genetic trait (1)(2)(3)(4)(5). The genetic basis of gene expression can be treated in a manner similar to that of other complex quantitative phenotypes, such as body weight and blood pressure. The identification of quantitative trait loci associating with individual mRNA expression levels (expression quantitative trait loci (eQTL) 1 ) has been reported in yeast and mice using recombinant inbred strains (6 -11). It has also been described in humans using lymphocytes, Epstein-Barr virus-transformed lymphoblastoid cell lines (LCLs), and tissue biopsies (10,(12)(13)(14)(15)(16)(17)(18)(19)(20)(21). Several recent studies using eQTL analysis have also been used to narrow the list of candidate genes for Type I diabetes and coronary artery disease susceptibility in obese mouse strains (7,12,13). These initial studies have established that more than 80% of mRNA expression phenotypes are heritable and that, on average, 30% of the variation is due to the genetic sources. As a result, the mapping of eQTL responsible for gene expression variation in disease suggests that eQTL association studies may be an additional powerful method in uncovering the causal genetic determinants of disease and disease susceptibility (5,7,11,12,18,20,(22)(23)(24)(25)(26).
Most disease phenotypes have both causal and reactive relationships to several hundred protein changes across multiple tissue types and environmental variables (26 -28). Consequently, the genetic components that are involved in complex disease could number in the hundreds. However, disease phenotype variations must relate to variation of individual proteins. One hypothesis is that each protein trait should be less complexly associated with the genome than is a clinical disease phenotype. Therefore, determining associations between genetic polymorphism and individual protein traits affords a method of building the complex relationship between genome and disease from the bottom up. Furthermore, an understanding of this relationship has additional practical implications in disease and drug response protein biomarker research in humans where identification of associating poly-morphisms could be utilized as co-variables in the discovery process.
The regulation of any given protein level is influenced by transcript abundance and post-translational modifications that can increase or decrease protein turnover in response to physiological conditions. This additional regulation at the protein level means that, frequently, transcript levels do not correlate well with corresponding protein expression levels (29). Furthermore, many disease phenotypes are directly related to the degree and type of protein post-translational modifications, such as alterations in protein folding and ubiquitin modifications (30 -32). Therefore, a deeper understanding of the molecular relationship between genotype and disease can be ascertained by studying the relationship between the genome and proteome.
Efforts to identify genetic elements associated with protein expression variation in yeast and mouse brain tissue have been described (33)(34)(35). Using linkage analysis in yeast, Foss et al. (34) compared results for both transcript and proteinassociated QTL in identical samples resulting from an intercross of two divergent yeast strains. This study established that proteomic variation is just as heritable as transcript variation; however, loci associated with transcript variation was more readily mapped compared with those associated with protein variation. This implies that either more technical or stochastic noise is present in a proteomic evaluation or that proteomic regulation is more complexly related to the genome than is transcript variation. In support of the latter assertion, more of the linkages that were identified for protein regulation were located outside of the region of the genome that codes for the transcript (i.e. in trans). In other words, alleles not associated with the transcript coding region were more likely to appear as the primary regulators of protein expression in this system. A similar observation was made by Klose et al. (35) in describing the identification of linkages associated with shifts in migration patterns on two-dimensional (2D) gels of mouse brain tissue.
In humans, although there is a significant body of literature describing efforts to identify transcript QTL, efforts to identify proteome QTL have lagged behind. In one study, Melzer et al. (33) describe experiments to associate single nucleotide polymorphisms (SNPs) with levels of 42 serum proteins detected in clinical assays in a population of 1200 human subjects. This effort resulted in the identification of protein QTL associating with eight serum proteins. In contrast to the work described for yeast and mouse linkage analysis by Klose et al. (35) and Foss et al. (34), the QTL identified in this study were mostly located in the region surrounding the transcript-coding allele (i.e. in cis). Collectively, these previous studies highlight the additional complexities of identifying genetic elements associated with proteins as compared with genetic elements associated with transcripts. Therefore, success for such studies in the future will be highly dependent on identifying the most appropriate methods for quantitatively evaluating the proteome.
Large scale quantitative proteomics is a challenging methodological process. Shotgun proteomics approaches in which proteins are digested in mixtures and separated by liquid chromatography upstream of mass spectrometric detection is increasingly used for large scale unbiased proteomics. In this approach, relative protein quantitation can be carried out using either differential labeling with isotopic reagents or label-free methods, such as peak height determination or spectral counting (36 -41). When using proteomics to uncover the relationship between genomic polymorphisms and proteomic variation, a more appealing approach would be to target intact proteins where differences in isoforms and post-translational modifications can be reliably monitored in contrast to the shotgun approach. Currently, the most reliable method of unbiased quantitative analysis of intact proteins is 2D DIGE. This approach takes advantage of the quantitative properties of fluorescent protein labeling dyes coupled with large format 2D DIGE (42,43).
In this study, we describe technical, biological, and genetic sources of proteomic variation in a small cohort of human LCLs derived from the Centre d'Etude du Polymorphisme (CEPH) cohort of Caucasians of European descent living in Utah (CEU) using 2D DIGE. The purpose of this work is to identify genetic elements responsible for variation in protein expression in human cells. As mentioned above, 2D DIGE is unique among proteomics methods in allowing the ability to quantitatively monitor differential modifications of intact proteins independently. Therefore, this approach should allow for observation of regulatory polymorphisms influencing protein expression levels in addition to other polymorphisms that could influence the pI and molecular weight of any given protein.
To determine how much variation in protein expression levels detected by 2D DIGE analysis can be explained by genetic differences between the cell lines, we performed a protein expression quantitative trait locus (peQTL) analysis of 544 basal protein expression phenotypes using publicly available genotype data and both cis-only and genome-wide association (GWA) approaches. As a result, we identified a total of 24 peQTL associated with individual protein phenotypes. These results affirm that variability in cellular protein expression in human LCLs is influenced by polymorphisms in the genome and provide the first broad scale proteomics analysis to complement previous eQTL efforts using this model system. Furthermore, our results demonstrate that 2D DIGE has both the sensitivity and specificity to be used in understanding sources of biological variation in the proteome.

EXPERIMENTAL PROCEDURES
2D DIGE Materials-Sodium chloride, Trizma (Tris base), bromphenol blue, chloroform, N,N-dimethylformamide, phosphatase inhibitor, sodium carbonate, sodium bicarbonate, ammonium bicarbonate, ammonium monobasic phosphate, L-lysine, and ␣-cyano-4-hydroxycinnamic acid were obtained from Sigma-Aldrich. Urea, thiourea, SDS, DTT, Tris, and iodoacetamide were obtained from GE Healthcare. Protein labeling dyes (CyDye DIGE fluors) with an N-hydroxysuccin-imide ester reactive group for reaction with the amino group of lysine of proteins were obtained from GE Healthcare. CHAPS was obtained from USB Corp. Complete Mini protease inhibitor mixture tablets were obtained from Roche Applied Science. Methanol, acetonitrile, and HPLC grade water were obtained from Burdick & Jackson. Acetic acid was obtained from Mallinckrodt Baker. CEPH-CEU LCLs were acquired from the Coriell Institute for Medical Research.
Sample Preparation and Protein Labeling-Upon thawing, cells were lysed with 180 l of TNE buffer (50 mM Tris-HCl, pH 7.6, 150 mM NaCl, 2 mM EDTA, pH 8.0, 1% Nonidet P-40) with protease and phosphatase inhibitors added. Soluble protein was extracted by centrifugation at 14,000 rpm for 10 min at 4°C. Protein concentration was determined using the BCA Protein Assay kit (Pierce). Protein disulfide bonds were reduced in 2 mM DTT. For each replicate, 25 g of each sample was precipitated using 4:1 (v/v) methanol:chloroform. The samples were washed once with methanol and resuspended in L-buffer (7 M urea, 2 M thiourea, 4% CHAPS, 30 mM Tris, pH 8.5). An internal standard was created by pooling 8.5 g of each sample. The remainder of each sample was then labeled with 3 pmol of either Cy3 or Cy5 dye/ g of protein. The internal pooled standard was labeled using 3 pmol/g Cy2. Reactions were stopped by adding 10 mM lysine in equal volume to the volume of dye. For each gel, a Cy3-and Cy5-labeled protein sample was mixed with an equal volume of Cy2-labeled standard. Samples were then reduced and denatured in R-buffer (7 M urea, 2 M thiourea, 4% CHAPS, 13 mM DTT) at 10:1 (v/v) with ampholytes (IPG Buffer) (GE Healthcare). The samples were then applied to a 24-cm pH 4 -7 Immobiline DryStrip (GE Healthcare).
2D DIGE and Image Analysis-First dimension isoelectric focusing was performed using an Ettan IPGphor3 (GE Healthcare) for a total focusing time of 64.5 kV-h at 20°C and 75 A/strip maximum. After focusing was complete, each strip was equilibrated with 10 ml of equilibration buffer (6 M urea, 75 mM Tris, pH 8.8, 29.3% glycerol, 2% SDS, 0.002% bromphenol blue) with 10 mg/ml DTT added for 15 min. The strips were then equilibrated with 10 ml of equilibration buffer with 25 mg/ml iodoacetamide for 15 min. The strips were then loaded onto 12% homogenous SDS-polyacrylamide gels (NextGen Sciences). Electrophoresis was performed on an Ettan DALTtwelve System Separation Unit (Amersham Biosciences) at 0.5 watt/gel until the bromphenol blue dye front had just run off the bottom of the gel. Once complete, the gels were scanned at 488, 532, and 633 nm using a Typhoon Trio Variable Mode Imager (GE Healthcare) and analyzed using the DeCyder 2D v6.5 software (GE Healthcare).
Protein Identifications-A pool of 10.5 g of protein from each sample was used to generate a preparative 2D gel for spot excision and protein identification. Visualization of proteins was performed by poststaining using a 1:200 dilution of Deep Purple Total Protein Stain (GE Healthcare) for 1 h. The image was then imported into the DIGE analysis using the DeCyder software and matched to the prior analytical gels. Gel plugs corresponding to 492 proteins were isolated using an Ettan Spot Picker (GE Healthcare). Dehydrated gel spots were digested using Trypsin Gold (Promega) by overnight incubation. The digests were then desalted and concentrated using C 18 P10 ZipTips (Millipore). One-third of each sample was loaded onto the ZipTip by pipetting up and down five times into a fresh 96-well plate. The sample was then washed four times with 10 l of 0.1% TFA and eluted with 2 l of 50% acetonitrile, 0.1% trifluoroacetic acid. A total of 1 l of eluate was pipetted onto a clean MALDI plate and covered with 1 l of ␣-cyano-4-hydroxycinnamic acid MALDI matrix. Mass spectra of each spot were acquired using an Applied Biosystems 4700 Proteomics Analyzer MALDI-TOF-TOF instrument. Data were acquired in reflectron positive mode for the mass range 900 -3500 Da. Shots per spectrum for both MS and MS/MS totaled 2000. Data-dependent MS/MS analysis was performed on the top 10 peaks from each MS spectrum. Peaks excluded from analysis correlated with trypsin autolysis peaks at 844.0, 1047.5, and 2257.5 Da. Peak lists were generated using the 4000 Series Explorer v3.0. The data were searched using Mascot v1.9.0 (Matrix Sciences) against the human NCBI protein RefSeq database build 36.3 consisting of 37,742 sequences. Search parameters included trypsin enzyme specificity with two permitted miscleavages. Variable modifications included carbamidomethyl-Cys and oxidized Met. Mass tolerances for both precursor and fragment ions were 1 Da. Identification criteria included a Mascot score Ͼ65 (selected based on a corrected p value Ͻ0.01), at least 5 orders of magnitude difference in the expect score between non-isoformic first and second hits, at least two unique MS/MS peptides scoring at least 20 (p Ͻ 0.01), and the protein having a molecular weight similar to experimentally derived values. Where the second hit was an isoform of the first and had identical scores, the first is reported. Where the second hit was an isoform with a lesser score, the higher scoring isoform is reported. Where a peptide matched to multiple proteins in the database, the peptide was assigned to the higher scoring protein. All peptide mass matches for an accepted protein are reported in supplemental Table 2 regardless of whether the matching peptide from the MS/MS search achieved the score cutoff.
Association Analysis-Protein expression levels from 544 spots derived from the average between two technical replicates were used as protein expression phenotypes for genetic association analysis. Genotype information from 3.8 million SNPs was downloaded from the International HapMap Project web site. Because of our small sample size, we removed SNPs with minor allele frequency Ͻ0.1 and genotype missingness Ͼ0.1 prior to association. This filtering resulted in a set of 1.7 million SNPs that were used in the final association analysis. For the cis-only analysis, SNPs were limited to the 200-kb region surrounding the transcript-coding allele. For each protein, expression levels were log-transformed and treated as the response variable. Based on the allele composition of the genotypes, each SNP (coded as 0, 1, and 2) constituted the single dependent variable, and a linear regression was performed using the genetic toolkit PLINK (44). Statistical significance of the strength of the relationship between the response variable and the dependent (genotype) variable was measured using the Wald test. Multiple testing corrections were made using the Benjamini and Hochberg false discovery rate (BH-FDR) procedure (45). Results were visualized and annotated using WGAViewer (46). Power calculations were performed using the genetic power calculator (47). We used the following settings: (a) locusspecific heritability ranging from 0.03 to 0.8, (b) sibling correlation ϭ 0.5, (c) linkage disequilibrium (LD) (DЈ) ϭ 1, (d) QTL increaser allele frequency ϭ 0.3, (e) marker allele frequency ϭ 0.3, (f) sample size ϭ 16, sibship size ϭ singletons (both parents genotyped), and (g) type I error rate (␣) ϭ 0.0001 and 0.0000001. Higher sibling correlation, a measure of total heritability of the traits, increases the power. Hence, we considered a sibling correlation of 0.5 as a reasonable setting: not too high, not too low, and typical for many traits. Locus-specific heritability values were considered from 0.03 (very small) to 0.8 (large). Because we considered 1.7 million SNPs, there was a high chance that the markers and the pQTL were near to each other; therefore, we set LD (DЈ) to 1. LD is the non-random association of alleles at two or more loci. DЈ is a standardized measure of the degree of LD with a maximal value of 1. Supplemental Table 5 summarizes the power for cis and genome-wide analyses. We considered an ␣ of 1.00eϪ07 for genome-wide analysis because it involved 1.7 million SNPs per trait. The cis-only analysis involved far fewer SNPs per trait; therefore, power calculations considered an ␣ of 0.0001 for cis-only analysis.

RESULTS
Lymphoblast Proteome-An overview of the experimental approach is shown in Fig. 1. A 2D DIGE analysis was performed to determine base-line proteome expression levels in LCLs derived from 24 individual cell lines derived from 16 unrelated individuals and their eight children. Gel spots were numerically identified and matched between gels using the internal standard. Gel images were manually examined to determine the quality of the matching. Incorrectly matched spots were eliminated or corrected if possible. Spots were then eliminated from analysis if they appeared to be artifacts from streaking or were affected by streaking of another spot. Spots were also eliminated if not detected in more than one gel. As a result of these quality control measures, a total of 544 protein spots were subsequently analyzed (Fig. 2).
A separate preparative gel of the pool of all 24 samples was run and poststained with Deep Purple. On this gel, 492 of the 544 proteins appeared with high spot resolution and abundance. Gel plugs for these 492 proteins were isolated and subjected to an in-gel trypsin digestion followed by MALDI-MS/MS analysis. Identification of proteins was determined by combined peptide mass fingerprinting and MS/MS fragmentation pattern using Mascot. Using a Mascot cutoff score of 65 (false positive rate Ͻ1%), proteins within 311 spots were identified. These 311 spots corresponded to 243 different proteins. Predicted and actual molecular weights and pI values for proteins were compared for verification. All protein identifications are presented in supplemental Table 1. Protein and individual peptide IDs are presented in supplemental Table 2.
Proteome Transcription Correlation Analysis-The correlation of basal expression levels of the proteins with corresponding transcripts was analyzed using the publicly available raw transcript data as reported by Morley et al. (20) and acquired from the Gene Expression Omnibus accession number GSE2552. Pearson's correlation coefficient was calculated to compare protein expression values for the 16 individual parents with the gene expression values for the same 16 individuals. Each expression value was derived from the average of two replicates. Of the 311 identified spots, 251 could be directly compared with transcript values from the published microarray analysis. Of these, 41 showed a significant and highly positive correlation with transcript data (r Ͼ 0.5 and p Ͻ 0.05), and only two showed a significant negative correlation (r Ͻ Ϫ0.5 and p Ͻ 0.05). The remaining proteins did not correlate significantly with transcript levels (supplemental Table 3).
Sources of Variation-To address a concern that lower intensity spots would exhibit greater technical variability and therefore reduce the power to determine sources of biological variation among lower abundance proteins, a linear regression was performed to determine the role of spot volume in population variance. Averaged volume of the same gel spot detected across the population of 24 individuals was used as the predictor variable. Population variance for each spot calculated from the standardized and normalized log expression abundance values across the 24 individuals was used as the dependent variable. As shown in Fig. 3A, following normalization and standardization, spot volume accounts for only 1% of the experimental differences (r 2 ϭ 0.01).
Another variable that could contribute to protein expression variation is the cell population doubling time (PDT). As reported by Choy et al. (48), many transcript phenotypes in these same lymphoblast cell lines were linked to the growth rate and ATP levels of the individual cell lines. Such an observation confounds the ability to find true genetic effects. To determine the PDT, cell lines were plated at 2 5 cells/ml and were counted daily until cells reached 10 6 cells/ml. PDTs ranged from 0.21 to 0.82 doublings/day across the 24 cell lines. A regression analysis was performed using expression values for each protein as the response variable and the PDT of each cell line as the predictor variable. The expression levels of 53 proteins were significantly related to PDT (p Ͻ 0.05, r 2 Ͼ 0.5). The 53 proteins represent ϳ10% of the total number of proteins analyzed, slightly more than that reported for transcripts by Choy et al. (48). Protein expression was evenly split between positive PDT associations and negative associations (supplemental Table 4). Two examples of PDTassociated protein expression patterns are shown in Fig. 3, B and C. Among these 53 spots, we were able to identify 26 proteins. Functional annotation of these proteins revealed enrichment for nucleotide-interacting proteins and proteins with ATPase activity (p ϭ 0.003; not shown).
Among the proteins associating with cis-acting SNPs was a six-spot cluster in the 2D DIGE gel identified as L-plastin (LCP1) (spots 675, 677, 687, 688, 691, and 693). Several polymorphisms in the region of the LCP1 gene were observed to be significantly associated with expression of each of these six spots (for an example of spot 691, see Fig. 4A). One of these (rs4941543) is a non-synonymous single nucleotide polymorphism (NS-SNP) that causes a Lys to Glu alteration at position 533 in the LCP1 protein. To verify genetic association with this SNP, we determined r 2 for the 16 parental cell lines in addition to the eight children. Relative concordance was observed between the results of the parents alone and parents plus children (Table I, last two columns). In Fig. 4B, the LCP1 protein abundance from all 24 individuals in the population for all six spots clearly shows differential allele-specific expression patterns based on the presence of the Lys-coding allele. This substitution is predicted to shift the pI of LCP1 protein from 5.3 to 5.2. Such a shift is evident on the 2D DIGE gels and segregates with Lys-coding versus Glu-coding genotypes selected from the population (Fig. 4C).
Another of the proteins associating with a cis-acting QTL was Copine-1 (CPNE1). Expression levels of the CPNE1 protein (spot 780) associated with a cis-acting peQTL on chromosome 20 (rs3746410, p value ϭ 4.94eϪ09). This region continues for about 600 kb upstream through intron 1 of the CPNE1 transcript coding region (Fig. 4A) and coincides with the transcript eQTL identified for the CPNE1 transcript when microarray analysis of these same cell lines was performed by two independent groups (15,20). To determine the exact overlap between the peQTL and the eQTL, we obtained the raw transcript expression data for CPNE1 available from the GeneVar database (15,17) and repeated the cis association analysis using data from only the 16 parental lines used in our peQTL analysis. Fig. 5A shows the genomic region that is significantly associated with both protein (bottom panel) and transcript expression (top panel) of CPNE1 and indicates that the regions are identical. Correlation of both the protein expression and transcript expression with a representative SNP (rs3746410) is shown in Fig. 5, B and C.
Two spots (495 and 502) identified as hematopoietic cellspecific lyn substrate 1 (HCLS1) proteins are associated with SNPs located in cis to the HCLS1 gene. These spots were part of a four-spot cluster that included spots 517 and 490  Table 3. ( Fig. 6A). Spot 517 was also identified as HCLS1, and spot 490 remains technically unidentified. When expression values derived from spot 490 were used as a phenotype trait in association with cis-acting SNPs in the HCLS1 gene, the same set of SNPs was highly significant (e.g. rs1128163, p value ϭ 9.39eϪ09), suggesting that the protein represented by spot 490 is likely to be HCLS1. Additional SNPs in the 200-kb region surrounding the HCLS1 gene were significantly associated with expression (Fig. 6B) and ranged from the 5Ј-upstream region to the 3Ј-UTR. The correlation of SNP rs1128163 to protein expression level across the full 24 individuals in our population for spots 490, 495, and 502 (Fig. 6, C-E) supported the model. The exception was spot 517 (Fig.  6F). We also analyzed raw transcript data for HCLS1 transcript from analyses with these same 16 parental cell lines from the GeneVar database and did not find SNPs within the HCLS1 locus that were associated with transcript variation (rs1128163, p value ϭ 0.28). GWA Tests for peQTL-A full GWA analysis was then performed for each of the 544 analyzed spots regardless of whether the corresponding protein was identified. Each individual protein trait (i.e. spot) in the 16 parental cell lines was tested by linear regression against a total of 1.7 million SNPs available from the International HapMap Project. We applied the BH-FDR correction for multiple testing, setting a cutoff of 0.01. Because of the small sample size and potential for false positive results due to random association, we also considered only peQTL that were identified with three or more SNPs within a 100-kb region. To locate the QTL associated with each trait, the Ϫlog(p value) for association of each SNP with each protein trait was plotted according to the position of the SNP in each chromosome. As with the cis-only analysis, to verify the model, a representative SNP identified through association analysis with the 16 parents was plotted against protein expression from all 24 individuals, and a new Pearson's correlation coefficient was calculated. This resulted in the identification of eight proteins with associating QTL meeting GWA significance (Table II).
Several of the cis-acting QTL for LCP1, CPNE1, and HCLS1 that were identified through the cis-targeted analysis also met statistical significance in the GWA study. The standardized abundance of three spots, 688, 691, and 693, that were identified as the LCP1 protein showed a statistically significant association with rs4941543 (as described above) in addition to a second trans-acting QTL. This second strongly associating QTL region for all three of these spots was an intragenic region of chromosome 3 (p values Ͻ10eϪ08). However, when protein abundance from all 24 individuals in the population was plotted against genotype for the most significant SNPs in the chromosome 3 QTL region, the correlation was less apparent than when only the 16 parents were used (supplemental Fig. 1). The cis-acting QTL identified for spots 780 (CPNE1) and 490 (HCLS1 inferred) were also significant in the GWA analysis with no additional trans-acting QTL identified.
Two putative trans-acting peQTL were identified only within the GWA analysis (Table II). The expression level of vinculin (spot 175) was significantly associated with a trans-acting QTL located within the exocyst complex component 4 (EXOC4) gene and directly upstream of LRGUK, a leucine-rich repeats and guanylate kinase-containing gene on chromosome 7 (rs12539993, p ϭ 3.75eϪ10). The EXOC4 gene has been previously identified in a GWA analysis of arterial stiffness phenotypes within the Framingham Heart Study (51). However, the correlation between the genotype of a representative SNP (rs12539993) and vinculin protein expression with children included was weaker (parents r 2 ϭ 0.94 versus all r 2 ϭ 0.40; Table II).
The expression level of LCP1 protein in spot 675 was associated more strongly with a trans-acting allele on chromosome 14 than it was for the cis-acting allele identified in our cis-only results (compare the top spot 675 results in Tables I and II). At the chromosome 14 location, five SNPs located around or within the NCP2 gene were significantly associated (p ϭ 1.7eϪ08) with expression levels of spot 675. However, as was the case with vinculin, the correlation between genotype and expression values for spot 675 with children included was weaker than when considering the parents alone (parents r 2 ϭ 0.88 versus all r 2 ϭ 0.37; Table II and supplemental Fig. 2).
Spots 1415, 729, and 741 were associated with QTL on chromosomes 10, 11, and 2, respectively, but the proteins in these spots remain unidentified (Table II). The QTL associated with the expression level of spot 1415 lies entirely within the coding region of the MPP7 gene encoding a membrane-associated guanylate kinase, and the QTL associated with the expression level of spot 729 lies entirely within the gene coding for GUCY1A2, a guanylate cyclase protein. The region on chromosome 2 that is associated with expression levels of spot 741 is intergenic with no nearby gene features. Correlations between a representative genotype and protein expression levels among these unknowns are shown in supplemental Fig. 3. Consistent with what has been reported for base-line transcript expression in these cells, a proportion (10%) of the proteins exhibited strong correlation of expression with cell line PDT (Fig. 2) (48). Finally, using genetic association methods and the publicly available genotype data, we identified 24 protein forms for which Ͼ50% of the expression variation could be explained by genetic differences between cell lines.
We localized associating peQTL and, in some cases, the putative causative SNPs that contributed to variation in allelic protein expression levels.
Using LCLs, others have identified transcript QTL associating with drug response phenotypes (48,52,53). Those studies have also shown that a number of technical and biological variables, if not controlled for, could confound any attempts to identify a complex genetic contribution to pheno-  types, and this was most evident for growth characteristics of the individual cells lines. We addressed this issue by measuring the PDT of each of the 24 cell lines used in our study and correlated protein expression phenotypes with the results. Our analysis indicated that the proteins most strongly associated with PDT belonged to families of ATP-binding proteins and other nucleic acid-binding proteins, consistent with variation in the control cell cycles and metabolic rates. Our analysis also indicated that 10% of all proteins analyzed correlated with PDT. This is higher than was reported for transcripts (48), suggesting a stronger relationship of the proteome to metabolism and cell cycle dynamics than what is reflected at the transcript level. However, the total number of proteins strongly correlating to PDT (e.g. r 2 Ͼ 0.5) did not limit our ability to detect peQTL. We then performed an analysis of genetic determinants associated with individual protein expression levels. Because of our small sample size, we selected alleles with a minor allele frequency Ͼ10%. This resulted in a large number of SNPs being excluded from the analysis and ensured that those SNPs considered would be reasonably represented in our small population. We also applied a very stringent BH-FDR cutoff of 0.01 and included verification of the association using expression and genotype values for the eight children of the 16 parents used in the analysis. These criteria were applied in an effort to avoid false positives from random associations. In light of these stringent acceptance criteria, our results are not meant to be an exhaustive study of genetic elements driving proteome variation but should be taken as a proof of concept that proteomic variation is influenced by genetics, and the details of this influence can be uncovered using our approach.
We performed the genetic association analysis using two approaches: 1) a cis-only approach that used SNPs within a 200-kb region surrounding the genes that coded for each protein to identify cis-acting SNPs that influenced downstream protein expression and 2) a second genome-wide approach that used high density SNPs across the genome to identify SNPs throughout the genome that influenced protein expression. Of the 311 proteins and protein forms that were analyzed, 19 protein forms corresponding to 11 proteins showed significant association with a cis-acting peQTL when only cis-acting SNPs were considered (Table I). Using a genome-wide approach in which 1.7 million SNPs were regressed against expression of all proteins, we identified 10 protein forms corresponding to seven proteins showing association at the genome-wide significance level. For three of these protein forms, the cis-acting QTL identified in the cisonly analysis was the most predominant QTL when the whole genome was considered. Two trans-acting peQTL were identified in the initial GWA screen; however, genotype-phenotype correlation dropped when the eight additional data points from the children were added to the original 16 parental data points (Table II). The relationship of QTL to encoding gene for another three protein forms could not be determined because the proteins within these spots have not been successfully identified.
The selection of this population of cell lines was based on the availability of high density, genome-wide, validated SNP genotypes for all the cell lines through the International Hap-Map Project (54). Previously reported studies using these same cell lines have shown transcript variation to be genetically heritable, and many transcript eQTL have been identified using similar methods (15,20). Because of this, we expected some degree of overlap between previously reported eQTL results and ours. Indeed, we detected several SNPs that were significantly associated with CPNE1 protein levels that were identical to those previously identified as associating with CPNE1 mRNA levels (15,20). The peQTL for CPNE1 expression covered the region mainly 5Ј of the CPNE1 gene in both previous studies and in our results. This indicates that cisacting transcription regulation variation contributes most prominently to population-based variation in both CPNE1 transcript and protein expression levels and further validates our methodological approach.
Another goal of this work was to determine the suitability of quantitative proteomics methods for QTL discovery. Many quantitative proteomics approaches have been described that could be applicable in a broad scale peQTL effort. Some of the more popular quantitative proteomics approaches are based on digested proteins separated in one or two dimensions where quantitation is facilitated either by stable isotope labeling of the proteins/peptides, the use of retention time measurements coupled with peak area (labelfree), or spectral counting. The only peQTL approach described prior to this work that was unbiased toward any given protein was a single dimensional shotgun approach using label-free quantitation methods (34). This effort yielded the identification of 635 proteins and 154 QTL linkages in yeast (34,55). A more targeted approach to peQTL analysis could also prove fruitful because such an approach could provide information about specific proteins as opposed to proteins detectable by proteomics methods. In this regard, a major peQTL effort in humans wherein 42 serum proteins were targeted for analysis mainly used established ELISAs (33). Likewise, an MS-based single reaction monitoring approach could yield similar results. No single quantitative proteomics approach yields the ability to analyze all proteins simultaneously, and this is an overall limitation of all proteomics approaches when compared with transcriptomics in which microarrays allow a more controlled and comprehensive result.
Whereas the 2D DIGE approach is somewhat limited in proteome coverage compared with a 2D shotgun approach, the capacity to analyze expression of intact proteins has obvious advantages in an effort to define genetic components that lead to proteome variation. Multiple spots corresponding to a single protein often occur in 2D gels. These are a result of NS-SNPs or post-translational modifications, such as phosphorylation and glycosylation, which can affect protein pI and/or molecular weight. Therefore, quantitative intact proteomics afforded by the 2D DIGE method allows insight not only into overall protein expression variation but also variation in post-translational modifications such as phosphorylation and glycosylation that cause spot shifts on 2D gels. Whereas the depth of proteome coverage when using a 2D shotgun approach may be superior to 2D gel coverage, there are limitations in the capacity of shotgun approaches to detect post-translational modifications as separate protein entities. In this regard, shotgun methods cannot account for the full complexity of the proteome. Therefore, 2D DIGE and shotgun methods should be considered complementary methods in peQTL efforts.
We identified several representative examples in which NS-SNPs resulted in pI shifts evident in the 2D gel. For example, the positions of multiple spots associated with LCP1 protein within the 2D gel appear to be a result of a QTL located on chromosome 13. The QTL on chromosome 13 is in cis with the LCP1 gene, and the causative SNP is likely rs9491543, which codes for either a Glu (major allele) or a Lys (minor allele) amino acid at position 533 in LCP1 protein. Depending on which is present, the amino acid substitution changes the pI of the protein, which would be expected to cause a shift in protein migration on the 2D gel. Indeed, such a shift in protein migration is evident in images of the LCP1 spots grouped by genotype (Fig. 3B). LCP1 was identified in a total of six spots and is known to be multiply phosphorylated near its N terminus (Fig. 3C), although it is not currently known which spots correlate specifically to phosphorylated forms. Two other examples where likely NS-SNPs resulted in pI shifts include the CDNP2 (spots 1001 and 1020) and GLRX1 (spots 1505 and 1509) proteins (supplemental Fig. 4).
Another critical advantage of utilizing 2D DIGE is the capacity to quantitatively monitor protein modifications as distinct phenotypes. Several of the peQTL that were not associated with NS-SNPs included the four-spot train identified as HCLS1. Among these, expression levels of spots 490, 495, and 502 showed significant association with SNPs in the 200-kb region of the HCLS1 gene on chromosome 3. Association of these polymorphisms with spot 490, which is inferred to be HCLS1, also met genome-wide significance in our GWA analysis. Neither the fourth spot in this HCLS1 cluster, spot 517, nor the HCLS1 transcript significantly associated with polymorphisms in the HCLS1 region. Interestingly, a closer examination of GWA results from spot 502 revealed a region of suggestive trans-acting association located on chromosome 1 (p ϭ 5.17eϪ07, FDR ϭ 0.08) (supplemental Fig. 5). This peQTL had a lower p value with spot 502 than the best cis-acting peQTL (p ϭ 4.62eϪ05, FDR ϭ 0.33). The chromosome 1 peQTL is within the TNNI3K gene, which codes for a tyrosine kinase, although the HCLS1 protein is not known to be a direct target of TNNI3K-mediated phosphorylation. Col-lectively, these results imply that regulation of expression of these multiple forms of HCLS1 is complex, potentially due to differential regulation of post-translational modifications. This observation could lead to a model in which some of the HCLS1 forms are driven mostly by genetic factors within the HCLS1 gene, whereas other proteins modify the overall regulation. The fact that the HCLS1 mRNA was not associated with cis-acting polymorphisms suggests that the protein and mRNA regulation is uncoupled (15). Moreover, these results represent a good example of the increased depth of information gained by utilizing protein expression traits, which may be among the first indicators of the genetics underlying more complex molecular regulation. With increased sample sizes, the additional genetic or environmental factors underlying these more complex regulatory processes may become more apparent.
We also observed some inconsistencies with other reported peQTL discovery efforts, including the proportion of cis versus trans QTL identified as well as the proportion of associating SNPs falling in coding versus non-coding regions. Previously reported efforts to map genetic determinants of protein expression on a large scale have been performed for 665 mouse brain proteins from 200 animals using 2D gels (35), 42 human serum proteins from 1200 subjects using mostly ELISAs (33), and 221 peptides from 98 yeast segregants using single dimension LC-MS/MS (34). In the study of the mouse brain proteome, Klose et al. (35) evaluated shifts in brain protein spot patterns on 2D gels resulting from outcrossing two divergent mouse species and identified 688 spots that showed a shift in electrophoretic mobility. Among these, however, only 41 spots were shown to correlate with genetic heterogeneity in backcross data. Most of these did not map to the location of the protein-coding gene, suggesting that trans-acting alleles accounted for the majority of the electrophoretic mobility variation. In the study of yeast proteomic variation, Foss et al. (34) monitored 221 peptides in replicates of 98 segregants of a cross between two laboratory strains of yeast using single dimension, high mass accuracy LC-MS/MS. These results were compared with transcriptome data derived from the same samples. When compared with each other, loci were detected for 56% of transcripts but only 38% of peptides, and there was little overlap between them. Linkage for peptide abundance was more likely to be trans-acting than what was observed for transcripts. Conversely, in the study by Melzer et al. (33) that focused on serum levels of 42 proteins, eight cis-acting loci were identified, and only one trans-acting locus was identified. Similar to Melzer et al. (33), in our results, we detected mostly cis-acting genetic elements that could be verified as associating with protein expression variation.
Another inconsistency between our peQTL discovery efforts versus previous eQTL or peQTL discovery efforts is the proportion of coding versus non-coding SNPs discovered. Previously reported eQTL studies revealed that the majority of genome variation associated with transcript expression lies within the regulatory regions of genes rather than coding NS-SNPs (14,15,19). This result is consistent with the observation that coding NS-SNPs account for very few disease associations and suggests that regulation of expression is an important factor in disease development (56). Similarly, according to Melzer et al. (33), of the nine identified SNPs associating with individual serum protein levels, none were found in the coding sequence of the relevant nearest gene (33). Our peQTL discovery efforts, however, resulted in an even split between the association of coding NS-SNPs and non-coding SNPs with protein expression variation. The 2D DIGE approach is highly sensitive to NS-SNPs that cause electrophoretic shifts due to protein pI changes. Thus, the choice of using 2D DIGE, combined with the higher density in SNP genotypes in our study, provided a greater power to detect associating NS-SNPs than did other proteomics methods. To determine how well we were able to detect NS-SNPs in our data, we determined how many identified spots potentially contain NS-SNPs and compared this with the number we actually identified through QTL analysis. Using the dbSNP database (NCBI, build 130) as a resource and based on whether the NS-SNP affected a charged or polar amino acid, we predicted that there are 81 proteins in our data that contained a total of 115 NS-SNPs genotyped in the HapMap project that could produce a shift in protein migration and therefore be detectable by 2D DIGE. However, of these NS-SNPs, only 19 were present and variable in our population at a minor allele frequency Ͼ10%. We identified four proteins that were associated with NS-SNPs through peQTL analysis. Therefore, the false negative rate was estimated at (19 Ϫ 4)/19 ϭ 78%. Considering all possible NS-SNPs, even ones that produce only neutrally charged amino acid shifts, we expected 43 among our 311 identified proteins leading to a false negative rate of 93%. Therefore, although our analysis showed some bias toward NS-SNPs, we were not capable of comprehensively identifying all NS-SNPs in the population.
Collectively, the differences in our results compared with those previously reported may be a reflection of the smaller sample size in our study. In this regard, the smaller sample size (e.g. 16 here versus 60 -270 in previous studies) effectively reduced the power to detect more complex regulation due to trans-acting factors and increased bias toward identification of NS-SNPs where the allelic contribution to phenotype is expected to be high.
In summary, because of the generation of high density human SNP maps and genome-wide sequencing efforts in the past decade, many genetic determinants of complex phenotypes and diseases have been identified (50,54,(57)(58)(59). Although there has been much attention paid to the role of genetics in transcript expression variation, very few large scale efforts have attempted the same for protein expression variation. The most successful approach for peQTL discovery is one that allows for both a high degree of technical reproducibility and the opportunity to capture proteomic variation due to alternative splicing and post-translational modifications. Our results demonstrate that 2D DIGE has both the sensitivity and specificity to be used for understanding the genetic underpinnings of proteome variation.