Insights into impact of DNA copy number alteration and methylation on the proteogenomic landscape of human ovarian cancer via a multi-omics integrative analysis

In this work, we propose iProFun, an integrative analysis tool to screen for Proteogenomic Functional traits perturbed by DNA copy number alterations (CNA) and DNA methylations. The goal is to characterize functional consequences of DNA copy number and methylation alterations in tumors and to facilitate screening for cancer drivers contributing to tumor initiation and progression. Specifically, we consider three functional molecular quantitative traits: mRNA expression levels, global protein abundances, and phosphoprotein abundances. We aim to identify those genes whose CNAs and/or DNA methylations have cis-associations with either some or all three types of molecular traits. In comparison with analyzing each molecular trait separately, the joint modeling of multi-omics data enjoys several benefits: iProFun experienced enhanced power for detecting significant cis-associations shared across different omics data types; and it also achieved better accuracy in inferring cis-associations unique to certain type(s) of molecular trait(s). For example, unique associations of CNA/methylations to global/phospho protein abundances may imply post-translational regulations. We applied iProFun to ovarian high-grade serous carcinoma tumor data from The Cancer Genome Atlas and Clinical Proteomic Tumor Analysis Consortium, and identified CNAs and methylations of 500 and 122 genes, respectively, affecting the cis-functional molecular quantitative traits of the corresponding genes. We observed substantial power gain via the joint analysis of iProFun. For example, iProFun identified 130 genes whose CNAs were associated with phosphoprotein abundances by leveraging mRNA expression levels and global protein abundances. By comparison, analyses based on phosphoprotein data alone identified none. A group of these 130 genes clustered in a small region on Chromosome 14q, harboring the known oncogene, AKT1. In addition, iProFun identified one gene, CANX, whose DNA methylation has a cis-association with its global protein abundances but not its mRNA expression levels These and other genes identified by iProFun could serve as potential drug targets for ovarian cancer.


In Brief statement
To elucidate the proteogenomic functional consequences of DNA copy number and methylation alterations, an integrative analysis tool iProFun was proposed. Compared to conventional approaches, iProFun achieves enhanced power and accuracy. When applied to TCGA-CPTAC ovarian cancer data, iProFun identified a collection of genes whose CNAs and/or DNA methylations influence either some or all three types of molecular traits (mRNA level, global protein, and phosphoprotein abundances). Results from this analysis may help nominate or prioritize candidate drug targets for ovarian patients.

Graphical Abstract
In this work, we propose iProFun, an integrative analysis tool to screen for proteogenomic functional traits perturbed by DNA copy number alterations (CNAs) and DNA methylations. The goal is to characterize functional consequences of DNA copy number and methylation alterations in tumors and to facilitate screening for cancer drivers contributing to tumor initiation and progression. Specifically, we consider three functional molecular quantitative traits: mRNA expression levels, global protein abundances, and phosphoprotein abundances. We aim to identify those genes whose CNAs and/or DNA methylations have cis-associations with either some or all three types of molecular traits. Compared with analyzing each molecular trait separately, the joint modeling of multiomics data enjoys several benefits: iProFun experienced enhanced power for detecting significant cis-associations shared across different omics data types, and it also achieved better accuracy in inferring cis-associations unique to certain type(s) of molecular trait(s). For example, unique associations of CNAs/methylations to global/ phospho protein abundances may imply posttranslational regulations.
We applied iProFun to ovarian high-grade serous carcinoma tumor data from The Cancer Genome Atlas and Clinical Proteomic Tumor Analysis Consortium and identified CNAs and methylations of 500 and 121 genes, respectively, affecting the cis-functional molecular quantitative traits of the corresponding genes. We observed substantial power gain via the joint analysis of iProFun. For example, iProFun identified 117 genes whose CNAs were associated with phosphoprotein abundances by leveraging mRNA expression levels and global protein abundances. By comparison, analyses based on phosphoprotein data alone identified none. A network analysis of these 117 genes revealed the known oncogene AKT1 as a key hub node interacting with many of the rest. In addition, iProFun identified one gene, BIN2, whose DNA methylation has cis-associations with its mRNA expression, global protein, and phosphoprotein abundances. These and other genes identified by iProFun could serve as potential drug targets for ovarian cancer. The initiation, progression, and metastasis of cancer often results from accumulation of DNA-level variations, such as DNA copy number alterations (CNAs) 1 and epigenetic modifications. CNAs involve gains or losses of a large region of tumor DNA that could result in activation of oncogenes or inactivation of tumor suppressors (1)(2)(3). Hypermethylation of CpG islands often results in silencing the expression of DNA repair genes when occurring in the promoter regions and activating oncogenes if in the coding regions (4 -8). Most major cancer types have been systematically profiled for copy number and CpG methylation, and as a result, many CNAs and DNA methylations have been identified and been associated to carcinogenesis and cancer progression (9 -14). However, it remains challenging to pinpoint diagnostic, prognostic, and therapeutic targets from this long list of cancerassociated genes. In particular, it is important to distinguish the driver genes that contribute to oncogenesis and cancer progression from the passengers acquired by random alterations during cancer evolution (2) and changes in gene activities that are the consequences, not causes, of cancer. To address this challenge, previous studies have primarily focused on associating CNAs and DNA methylations to their cis (i.e. local) gene expression levels (15), a form of molecular quantitative trait (QT) that is relatively easier to measure than protein abundances. Significant associations of CNAs and methylations with cis-mRNA expression levels partially reveal the molecular mechanisms of cancer-associated genes.
In addition to mRNA expression levels, it is of great interest to characterize the functional consequences of CNAs and DNA methylations on protein abundances. Global proteins and phosphoproteins are key molecules that carry out most cellular functions and are essential to cancer initiation, tumor progression, and response to therapy. The observed median correlations between mRNA gene expressions and global proteomic abundances, when measured and quantified in tumor tissues, are 0.45, 0.39, and 0.47 in ovarian (16), breast (17), and colorectal tumors (18), respectively. In addition to global protein abundances, phosphorylation often occurs on multiple distinct sites of a given protein, facilitating complex, multilevel regulation that is not reflected at the mRNA expression levels. By investigating the effects of CNAs (19) and separately the effects of DNA methylations (20) on global and phospho proteomic changes, novel insights have been obtained. However, due to a lack of proper analytic tools, few studies have employed an integrative approach to evaluate the impact of CNAs and DNA methylation on multilevel molecular QTs in tumor genomes from a systematic perspective.
Motivated by these challenges and needs, in this work we propose to conduct integrative analysis of multiple types of omics data in order to achieve a systematic and comprehensive understanding of the functional mechanisms of DNAlevel alterations in tumors. We propose a novel integrative analysis tool to screen for proteogenomic functional traits (iProFun) altered by CNAs and DNA methylations. Specifically, we are interested in (1) detecting genes with "cascading effects" on downstream molecular traits, i.e. a gene's DNA alterations have associations with its cis mRNA expression levels, and global and phospho protein abundances and (2) identifying associations unique to certain type(s) of molecular trait(s), in particular unique to global/phospho protein levels.
Several mechanisms could result in association patterns of DNA alterations unique to the protein levels but that may not be reflected on mRNA levels. First, specific proteins may be more stable molecules and have longer half-lives than their corresponding RNAs. For example, the median mRNA half-life is ϳ10 h in human cells (21), whereas certain proteins (e.g. histone proteins) degrade over months (22). Thus, in some instances protein abundances may more faithfully reflect accumulated transcription/translation activities in cells during tumor initiation and development and thus better capture the perturbations due to DNA alterations such as CNAs. Second, some DNA alterations (e.g. methylations or mutations) may result in amino acid sequence changes that alter proteinfolding structures and then the degradation speeds of corresponding protein molecules. On the other hand, the matching changes on mRNA sequences may have quite limited impact on stability of mRNA molecules. Thus, associations between such DNA alteration events and protein abundances without corresponding RNA data support would be observed.
Despite the urgent needs for integrative analysis methods and tools in biomedical research, the integration of data from multiple data types imposes tremendous statistical challenges, such as high dimensionality, complex gene-gene correlations, different scales and distributions among different types of omics data, and complete or partial overlapping of samples across platforms/data types (23,24). To address these challenges, the iProFun method takes as input genomewide summary statistics in assessing associations of CNAs and DNA methylations on each type of molecular trait and allows genes and molecular QTs to be arbitrarily correlated in the joint analysis. The iProFun method estimates the conditional density of each type of trait separately, allowing different scales and distributions among different data types. And, it also allows for sample correlations due to complete or partial overlapping of samples. Compared with the separate analyses of CNA and then methylation on each molecular trait, iProFun leverages data from multiple sources and borrows information across data types. By imposing rigorous assessment of false discovery rates (FDR), we have demonstrated that iProFun is able to largely boost power and maintain low FDR in identifying various types of cis-associations, in particular in the data types with relatively low sample sizes.
We applied iProFun to the high-grade serous ovarian carcinoma (HGSOC) data from the cancer genome atlas (TCGA) and the genome-wide proteomic data measured in clinical proteomic tumor analysis consortium (CPTAC). HGSOC is the leading cause of gynecologic cancer death in the United States (25) for which most women will present with advancedstage disease and ultimately die of their disease within five years (26). Thus, new treatments and an improved understanding of the biological basis of this cancer are desperately required. Using iProFun, we identified a collection of genes whose molecular functional traits at transcriptomic, proteomic, and/or phosphoproteomic levels were altered by somatic CNAs and DNA methylations. Some candidates in this list could serve as potential drug targets.

TCGA-CPTAC Ovarian Cancer Data-
The tumor sample data we analyzed were from 570 adults with HGSOC collected by TCGA. The proteomic and phosphoproteomic data were obtained from the CPTAC Data Portal and processed by the common data analysis pipeline from CPTAC. Both proteome and phosphoproteome data were acquired using iTRAQ (isobaric tags for relative and absolute quantification) protein quantification methods (16). Proteome data were from among 206 samples of 174 unique patients (84 from Pacific Northwest National Laboratory, 122 from Johns Hopkins University, and 32 measured by both centers). Phosphoproteome of 69 patients were measured exclusively by Pacific Northwest National Laboratory. A total of 7,061 global proteins and 10,057 phosphosites from 2,865 phosphoproteins with high quality were considered for analysis. The somatic CNA and mRNA data from the microarray platforms were directly downloaded from CPTAC publication (16), which were summarized by gene in their pipeline. The DNA methylation data measured on the Illumina 27K platform were downloaded from the TCGA Firehose pipeline processed in July 2016 at the Broad Institute (http:// gdac.broadinstitute.org/), with each methylation site taking beta values ʦ [0, 1] with 0 being unmethylated and 1 being fully methylated. Finally, the germline genotyping data were obtained from National Cancer Institute's Genomic Data Commons (27). The mRNA expression levels of 15,121 genes were measured on 569 samples, CNAs of 11,859 genes were measured in 559 samples, and DNA methylations for a total of 25,762 methylation sites from 14,269 genes were measured in 550 samples. More information about the samples and the associated metadata is available online (16,28).
Integrative Analysis Pipeline- Fig. 1 illustrates the integrative analysis pipeline-iProFun-for revealing dynamic cis regulatory patterns in tumors. Briefly, iProFun takes as input the association summary statistics from associating CNAs and methylations of genes to each type of cis-molecular trait, aiming to detect the joint associations of DNA variations and molecular traits in various association patterns. Of particular interest are the genes with "cascading effects" on all cis molecular traits of interest and the genes whose functional regulations are unique at global/phospho protein levels. iProFun can incorporate prior biological knowledge through a filtering procedure and can identify significant genes with calculated posterior probabilities exceeding a threshold while assessing the empirical FDR (eFDR) through permutation. Downstream enrichment analyses were also embedded into our pipeline to allow for more direct interpretations of different association patterns.
Preprocessing-All of the five data types were further preprocessed to allow integrative analysis. Specifically, the HGSOC proteomic data were generated at two independent CPTAC centers at Johns Hopkins University and Pacific Northwest National Laboratory, making it susceptible to center-level batch effect. We corrected center effects by analyzing the 32 overlapping samples processed in both centers, using linear normalization to match measured protein abundances. Then, we filtered genes with missing rate Ն 50% for each of the molecular traits. The methylation data generated using the Illumina 27K platform support the existence of only one or a few methylation sites per gene. Collinearity among sites were low, and thus they were modeled simultaneously in a gene-level regression model (as discussed herein) without concern. In cases where new methylation technologies are applied, such as generating over 450,000 methylation sites, multicollinearity is likely to occur. In this case, we propose summarizing high-dimensional methylation predictors into gene level prior to considering them in the regression.
To account for potential population structures and other major unmeasured confounding factors, when obtaining the summary statistics for associations, we adjusted for the top principal components calculated based on germline genotype data, using principal component analysis provided in PLINK 1.9 (29). Blood-derived DNA samples were used as the primary source of germline genotype data, with solid normal tissue samples used as a surrogate for subjects who were missing blood-derived DNA samples. We restricted the principal component analysis to bi-allelic variants on autosomes that met the following criteria: minor allele frequency Ն 0.05; Hardy-Weinberg equilibrium p value Ն 0.0001; and pairwise linkage disequilibrium r 2 Յ 0.2. Finally, we restricted our analysis to genes with quantitative measurements across all five data types (CNA, methylation, mRNA expression, global protein, and phosphoprotein) to understand their joint association patterns. A total of 676 genes were analyzed in the following steps.
Gene-level Multiple Linear Regression to Obtain Summary Statistics-Consider a total of G genes that passed the preprocessing procedures in all of the five data types; n 1 samples have measurements of mRNA, CNA and methylation, n 2 samples have measurements of global protein, CNA and methylation, n 3 samples have measurements of phosphoprotein, CNA and methylation. We use the following regression models for each type of molecular trait of interest: where i ʦ (1, . . . , n 1 ), j ʦ (1, . . . , n 2 ), and v ʦ (1, . . . , n 3 ) are sample indices, g ʦ (1, . . . , G) is the index for genes, and cov's are the sets of covariates adjusted in the regression analyses. The mRNA, global protein, and CNA have been summarized at the gene level prior to the analysis, and as such, only one variable for each data type is included. Multiple methylation sites may exist for a gene. We use m g to denote the number of methylation sites for gene g and jointly consider all m g methylation sites in the regression. Also, multiple phosphosites might exist for a gene. We conducted a prescreening analysis to select the representing phosphosite from all measured sites of a gene by ANOVA procedure. More specifically, we selected the phosphosite that minimizes the p value from ANOVA F-test by testing the null model (phospho vg ϭ ␥ 0g ϩ ␥ 3g cov v ϩ e vg ) versus the alternative model (phospho vg ϭ ␥ 0g ϩ ␥ 1g CNA vg ϩ ␥ 2g 1 methy v1 ϩ · · · ϩ ␥ 2g mg methy vmg ϩ ␥ 3g cov v ϩ e vg ). In the above sets of regressions, we adjust for the top three genotype-generated principal components for all three regression models. We adjust for only three principal components based on a separate investigation in which we compare the regression results adjusting for up to 15 principal components (see Supplemental Fig. S1).
We use sets of separate regressions in Equation (1) in the integrative analysis pipeline to allow for different samples being measured for different sets of molecular features. In comparison, a joint analysis of data from all five types may have only a limited number of samples with complete measurements on all five data types, while separate analyses of data from subsets of platforms without integration would ignore the potential correlations and connections among different types of genomic features. From this perspective, iProFun is especially appealing in boosting study power for associations with proteomic features because protein and phosphoprotein data are often measured on much smaller sets of samples than that of mRNA data, and as such, power to detect associations with proteins is also much lower. Moreover, the integration of summary statistics in iProFun also allows one to take advantage of existing summary statistics when raw data are not available.
Detecting Joint Associations of DNA Alterations with Multi-omics Traits-With the association summary statistics obtained from Equation (1), we apply an integrative analysis method-Primo-to detect joint associations of DNA variation with multi-omics traits (30). The Primo algorithm is briefly summarized below. Consider a total of G genes, and J sets of summary statistics in assessing the associations of CNAs on J-types of cis-molecular traits of interest (here J ϭ 3) in Equation (1). Let T denote a G ϫ J matrix of t-statistics obtained from regression. For any of the J quantitative traits, a gene can be either associated or not associated with the trait, i.e. its association status is binary. With J ϭ 3 omics traits, there are a total of K ϭ 2 J ϭ 8 possible association patterns for each CNA. Fig.  1 shows the list of all association patterns, as well as the interpretation for each association pattern. For example, ␣ 1g ϭ ␤ 1g ϭ ␥ 1g ϭ 0 denotes that the CNA from g-th gene has no association with any of the three types of traits, while ␣ 1g ϭ ␤ 1g ϭ 0; ␥ 1g 0 and ␣ 1g 0; ␤ 1g 0; ␥ 1g 0 denote that the CNAs that have effects on only phosphoproteins and have cascade effects on all of the three traits, respectively.
For each CNA, there must be one and only one true association pattern. Let k denote the "frequency" of CNAs in the k-th pattern across all genes in the current data. Then the probability of CNA i following pattern k given the t-statistics is given by where D k (⅐) is the conditional joint density function of the set of t-statistics of CNA i on J traits, conditioning on the k-th association pattern. In this equation, both D k (T i ) and k are unknown and need to be estimated. In estimating the conditional joint density D k (T i ), we assume that the t-statistics from different data types are independent, that is, where f j 0 (⅐) and f j 1 (⅐) represent the marginal null and alternative density functions for the association statistics to the j-th trait, respectively. The q kj is an 0 or 1 index to denote the association status of the k-th pattern in data type j. For example, the second pattern (k ϭ 2) (CNA impacts mRNA only) in Fig. 1 This does not assume statistics from different traits to be marginally independent; instead, we use k to capture both biological correlations and sample correlations among molecular traits. When there are overlapping samples from different omics data types, the t-statistics from different data types could be correlated. In the next subsection, we will further propose a permutation strategy to calculate FDR while accounting for potential sample correlation.
We applied the R limma approach (32,33) to estimate the null and alternative density functions f's for the J sets of association statistics. Limma pools information across all genes within each set of statistics and estimates the empirical nulls as t-distributions and the empirical alternatives as scaled t-distributions with an estimated scaling parameter. In applying limma, one needs to specify a priori the estimated proportion of nonnull statistics (i.e. statistics with nonzero associations) in each set of summary statistics. Those nonnull proportion estimates can be obtained from the literature or estimated based on the data. Primo is insensitive to underspecification of the parameters within a reasonable range. In this analysis, we specify that 5% (which is an underestimation) of association statistics are from the alternative for the effects of CNA on each of the three traits.
Once we obtain the estimated D k (T), we can estimate k 's for all association patterns in the current data using the expectation-maximization algorithm (34). When considering the association patterns as mutually exclusive "clusters," the conditional joint density can be viewed as the "class centroid." The estimation of the posterior prob-ability of a CNA following a particular association pattern assesses the relative strength of that pattern compared with other patterns.
Separately, we applied the Primo method to analyze the effects of methylation sites on three types of traits, with similar parameter specifications, and obtained the results for the effects of methylations on cis-molecular traits.
False Discovery Rate Determination-In addition to calculating posterior probability for each association pattern for each gene, we proposed to also calculate the empirical FDR based on permutations. The empirical FDR can serve as an additional significance measure in accounting for the data-dependent correlation structures and sample overlapping among different data types.
To calculate the empirical FDR, we first calculated the posterior probability of a predictor being associated with an outcome by summing over all patterns that are consistent with the association of interest. For example, the posterior probability of a CNA being associated with its cis mRNA expression level was obtained by summing up the posterior probabilities in the following four association patterns-"CNA affecting mRNA only," "CNA affecting both mRNA & global protein," "CNA affecting both mRNA & phophoprotein," and "CNA affecting all three traits," all of which were consistent with CNA being associated with mRNA expression. Then, we permuted our sample m times (e.g. m ϭ 100) to recalculate the summary statistics and use them for calculating empirical FDRs. More specifically, for each molecular trait, we randomly permute the sample label of the trait while keeping the labels of the other two traits. Then, we reran gene-level multiple regressions and Primo analysis and calculated the posterior probability of a predictor being associated with that trait based on the null datasets. Then, for a prespecified posterior probability cutoff value ␣, a gene was considered positive if its posterior probability is Ͼ ␣. We calculated empirical FDR as eFDR ϭ Averaged No. of genes with posterior probability Ͼ ␣ in the permuted datasets No. of genes with posterior probability Ͼ ␣ in original dataset for a prespecified ␣ value. We consider a grid of ␣ values (e.g. among 75% Ͻ ␣ Ͻ 100%). We consider a minimum of 75% posterior probability and an averaged empirical FDR Ͻ10% as the joint significance criteria in declaring significant associations. Additionally, we added the following filtering procedure to incorporate prior biological knowledge. Literature has suggested that CNA amplifications have often been associated with increased molecular quantities, and CNA deletions have been linked with decreased molecular quantities (17,18,28). Meanwhile, variations in methylation might have associations in both directions. For example, hypermethylation in promoter regions may silence DNA repair genes (negative association), but hypermethylation in coding regions may activate oncogenes (positive association). Based on the criteria, only genes with association directions matching biological knowledge (all positive for CNA and same direction for methylation across all traits) were retained and further assessed by empirical FDR.
Empirical Simulation Assessment-To evaluate the performance of iProFun on error control and power with different sample sizes, we performed simulation assessment based on ovarian cancer data and compared the results with an alternative approach that conducted separate regression for each molecular quantitative trait. Specifically, we generated three omic outcomes using real CNA samples for the same number of genes as investigated in ovarian cancer data. In real data, the number of significant CNAs are 500, 457, and 117, and nonsignificant CNAs are 77, 219, and 559, respectively for RNA, protein, and phosphoprotein. For each CNA that was not significantly associated with a molecular trait, we simulated the effect size (coefficient) as zero and variance of regression error as the estimated value

Molecular & Cellular Proteomics 18.14 S55
in data. For each CNA that was significantly identified, the effect size was fixed as half of the estimated coefficient, and then data were simulated centered on this effect size, using the observed variance of the regression error. We assessed the powers as well as true FDRs for sample sizes of 50, 100, and 150, respectively. For each of the sample size cohorts, we simulated 100 datasets. We applied both iProFun and linear regression to test the associations between CNA and those three omic outcomes using t-statistics. With FDR ϭ 0.1, we assessed the powers as well as true FDRs for each sample size. Subset Analysis-We further conducted threefold subset analysis with a random split of the samples to evaluate the replication rates and stability of identified cis associations. Specifically, stratified by the sample availability of each omic data, we randomly split the ovarian cancer data into three subsets, with each having ϳ190 mRNA, 58 global protein, and 23 phosphoprotein samples. We applied iProFun to two subsets of the data for identifications and repeated this procedure three times. We calculated the threefold crossvalidation consensus association sets to obtain the CNAs and DNA methylations that were identified in all three subsets of data. CNAs and DNA methylations in the consensus sets can be stably identified with different and smaller numbers of samples.
Concordant Analysis with TCGA-CPTAC Breast Cancer-It has been independently reported that ovarian and breast cancers can share genetic etiologies and driver mechanisms (35)(36)(37), and a pancancer study on TCGA gynecologic and breast cancers identified shared CNAs across these cancer types, more prevalent than shared DNA methylations profiles (38). With limited multi-omic proteogenomic ovarian cancer data available, we investigated the concordance of CNAs identifications among ovarian and breast cancers as a way to broadly validate the iProFun discoveries. The underlying rationale is that shared etiology often results in CNAs with similar impact on molecular traits, and thus iProFun discoveries are more likely to be valid if they are more concordant than would be expected (random associations) between ovarian and breast cancers.
Specifically, we considered 48 breast cancer samples drawn from TCGA-CPTAC consortium with -omic quantification on the same platforms as ovarian cancer. We conducted the same preprocessing procedure as ovarian cancer and applied iProFun to the samples and identified CNAs and DNA methylations associated with mRNAs, global proteins, and phosphoproteins at FDR Ͻ 10%. We calculated the percentage of breast cancer identified CNAs that were also significant in the ovarian cancer analysis and compared it with the percentage of breast cancer identified CNAs that were nonsignificant in the ovarian cancer analysis. We also investigated CNA-protein associations that were identified by iProFun but were not identified in separate analysis in ovarian cancer to determine if they were coidentified in breast cancer.

Landscape of CNA and DNA Methylation Association Patterns on Functional Molecular
Quantitative Traits-To characterize the impact of CNAs and DNA methylations on mRNA, protein and phosphoprotein abundances in ovarian tumors, we applied iProFun on TCGA and CPTAC data as described in "Methods." The posterior probabilities of eight cis-regulation association patterns (Fig. 1) of each CNA and CpG site methylation were calculated (Supplemental Tables S1 and S2), and the average posterior probabilities across all CNAs and methylation sites are shown in Fig. 2(A). Specifically, the averaged posterior probability is 45.3% for the CNAs to be from the all three cis-regulation association patterns, i.e. CNAs are associated with all three cis-QTs: mRNA expression level, global protein, and phosphoprotein abundances. The similarly common association pattern for CNAs is "mRNA and

Proteomic and Genomic Integrative Analysis Via iProFun
S56 global" (45.1%), with "mRNA only" the third most common category (7.9%). Only 1.6% of CNAs are estimated to be from the "none" association pattern, i.e. no association between CNAs and any of the three cis-QTs. On the other hand, for 1,103 DNA methylation sites of the 676 genes, the averaged posterior probability is 3.3% for the "all-three" association pattern, while 59.5% of posterior probability is from the "none" association pattern. Overall, there is a 36.3% probability for methylation sites to have a cis-regulation with transcripts (18.2% "mRNA only," 16.9% "mRNA and global," and 1.2% "mRNA and phospho"), and 1% for unique cis-association with protein levels (0.3% "global only," 0.4% "phoshpo only," and 0.4% "global and phospho"). It is clear that the effects of CNAs on cis-molecular QTs are much stronger than those of DNA methylations. Enrichment analyses based on posterior probabilities identified that chromosome arms 3p, 8q, and 10q are enriched with cascade genes with CNA associations to all three cis-molecular traits (Supplemental Fig. S2).
After further deriving the empirical FDRs for all cis-associations based on permutation tests (see "Methods"), we selected significant association pairs by requiring (1) empirical FDRs Ͻ 10%, (2) posterior probabilities Ͼ 75%, and (3) positive associations for CNAs with all three QTs and associations with consistent directions for methylations with all three QTs. Fig. 2(B) displays the Venn diagrams of the numbers of genes whose CNAs and/or DNA methylations were significantly associated with their cis-mRNA levels and global or phosphoprotein abundances. A full list of CNAs and DNA methylation sites with significant cis-associations is provided (Supplemental Tables S1 and S2). The Venn diagram of CNAs is comprised of concentric circles, i.e. all CNAs associated with their phosphoproteins were also associated with their global proteins, and all CNAs associated with global proteins were also associated with their mRNA expression levels. Specifically, out of the 676 gene-level CNAs in our analysis, 117 CNAs were identified as cascade CNAs, i.e. the CNA demon-strates significant cis association with all of the three traits (mRNA levels and protein and phosphoprotein abundances); 340 were associated with mRNA levels and global protein abundances but not phosphoprotein abundances; and 43 CNAs were associated with mRNA levels only. The Venn diagram of methylations presented more-complex association patterns. One gene (1 site) has a cascade methylation effect, 27 genes (27 sites) have mRNA and global protein effects, 2 genes (2 sites) have mRNA and phosphoprotein effects, 90 genes (94 sites) have mRNA only effects, and 1 gene (1 site) has global protein-only effects.
Highlighting Key CNAs and DNA Methylations with Biologically Interesting Association Patterns-Network analysis of these 117 cascade CNAs using the STRING V10 database (39) (Supplemental Fig. S5) revealed AKT1 as a key hub node interacting with many other cascade CNAs. The associations between CNA and mRNA expression levels and global protein abundances of CNA were observed in threefold cross-validation consensus set (supplemental Table S5). The gene AKT1, located on 14q32.33, is an important effector of the PI3K/RAS pathway. The methylation site cg10590292 in BIN2 has a cascade methylation effect. The methylation of BIN2 survived the stability check (consensus set) for its association with mRNA expression levels and global protein abundances. The gene BIN2, located on 12q13.13, is related to the innate immune system pathway. In addition, the methylation site cg13859478 in CANX was associated with its global protein abundances but not with its mRNA expression levels; two sites cg25416363 and cg06043190 in RBM15 and EML4, respectively, were significantly associated with their mRNA expression levels and phosphoprotein abundances but not with their global protein levels. Cross-referencing CNAs and methylations, three genes-CDH6, MAP2, and KRT8-had cascade CNA cis-associations and were associated with global or phosphoproteins with methylation results. Detailed information of all five omics data across the 69 samples for

Proteomic and Genomic Integrative Analysis Via iProFun
Molecular & Cellular Proteomics 18.14 S57 these key genes (AKT1, BIN2, CANX, RBM15, EML4 CDH6, MAP2, and KRT8) are presented in Fig. 3. Comparison with a Conventional Method in Real Data and Simulations-We next compared the power and false discovery control of iProFun to existing approaches. iProFun employs empirical assessment of FDR instead of using commonly used FDR control procedures for the following reasons: the q-value procedure (40) and Benjamini

Proteomic and Genomic Integrative Analysis Via iProFun
Molecular & Cellular Proteomics 18.14 S59 genes, none of them holds valid FDR control (Supplemental Fig. S6). Therefore, empirical assessment through permutation provides more accurate assessment and should be applied for error control.
With the empirical error control, Fig. 5(A) plots the number of genes discovered under iProFun and a conventional approach that separately considers each molecular quantitative trait in the analysis. The conventional approach only uses samples that contain the molecular quantitative trait of interest for regression, adjusting for the same covariates as our integrative analysis. Using the same FDR criteria (empirical mean FDR Ͻ 10%), iProFun identifies many more genes than separate analyses by borrowing information across data types. We observed higher power gain in the data types with smaller sample sizes, such as protein and phosphoprotein abundances. We observed substantial power improvement in identifying global protein QT CNA (pQTC) (from 430 to 497 genes), phosphoprotein QT CNA (phQTC) (from 0 to 155 genes), global protein QT methylation (pQTM) (from 0 to 71), and phosphoprotein QT methylation (phQTM) (from 0 to 10 genes), as well as marginal power gain in identifying mRNA expression QT CNA (eQTC) (from 498 to 501) and mRNA expression QT methylation (eQTM) (from 204 to 240). Our integrative analysis pipeline greatly boosts study power for gene identification, especially for the data types with small initial sample sizes.
We further demonstrated the power comparison based on empirical simulations using ovarian cancer data. Both iProFun and separate analyses preserve well-controlled false positive rates, as the simulations allow independence between different genes. However, iProFun is more powerful than separate analyses regardless of the sample sizes, effect sizes, and platforms under consideration (Fig. 5(B)). The power gain ranges from 1% to 94%, with an average 21% power increase.
Concordance with TCGA-CPTAC Breast Cancer-We further applied iProFun to proteogenomics data from the TCGA-CPTAC breast cancer study (17). We focus on 48 samples and 2,212 genes with all CNA, methylation, mRNA, protein, and phosphoprotein measurements in the TCGA-CPTAC breast cancer datasets. Among the 2,212 genes, 593 overlapped with genes considered in the previous ovarian cancer data analysis. For these 593 genes, 526 eQTCs, 445 pQTCs, and 384 phQTCs were identified at FDR Ͻ 10% (Supplemental  Tables S3 and S4). As demonstrated in Fig. 6, among 447 genes with eQTCs in ovarian cancer, 92% were also identified to have eQTCs in breast cancer. On the other hand, the replication rate was significantly lower (79%) for the 146 non-FIG. 6. Concordance of breast cancer CNA identifications among ovarian cancer identified CNAs. It is clear that a CNA is more likely to be identified in breast cancer to be associated with mRNA (p ϭ 0.0001) and global protein (p ϭ 0.0004) if it is already identified by ovarian cancer. eQTCs in ovarian cancer, generating an odds ratio (OR) of 2.9 (p ϭ 0.0001). Similarly, the percentage of breast cancer pQTC genes was 79% and 66%, respectively, among 410 ovarian pQTC genes and 183 non-pQTCs in ovarian cancer (OR ϭ 2.0; p ϭ 0.004); the percentage of breast cancer phQTCs were 72% and 64%, respectively, in 104 and 489 ovarian phQTC and non-phQTC genes (OR ϭ 1.4; p ϭ 0.14). In summary, for all three omics traits, we observed that genes with significant cis-regulations in ovarian cancer were more likely to have significant cis-regulations in breast cancer data than nonsignificant ovarian cancer associations (OR ϭ 1.4 -2.9). The concordance was strongest in eQTCs and weakest in phQTCs.
As mention in the previous section, compared with separate analyses, iProFun demonstrates enhanced power and identified additionally 3 eQTCs, 67 pQTCs, and 155 phQTCs in the ovarian cancer data set ( Fig. 5(A)). To assess the likelihood whether these additional detections represent true associations, we examined the "replication rate" of these pQTCs and phQTCs in breast cancer data. Among the 67 pQTCs and 155 phQTCs uniquely identified by iProFun in ovarian cancer, 58 and 139 genes were included in the breast cancer dataset. Among them, 69% of pQTCs and 72% of phQTCs were also confirmed to have significant cis-regulation in the breast cancer dataset. These high levels of replication rates based on an independent cohort with a different but related cancer type convincingly suggest that additional pQTC and phQTC identifications by iProFun are more likely to be biological relevant regulations than false positives due to data noises. DISCUSSION In this study, we introduced a novel integrative analysis tool, iProFun, to effectively detect proteogenomic functional traits altered by CNAs and DNA methylations by jointly modeling CNA, epigenome, transcriptome, global proteome, and phospho-proteome data. This integrative solution boosts power for detecting significant cis-associations and infers multi-omic association patterns by borrowing information across different omics data types.
We applied iProFun to the HGSOC tumor data from TCGA and CPTAC. HGSOC is the leading cause of gynecologic cancer death in the United States (25). In the United States this year, it is estimated that there will be more than 22,000 new ovarian cancer cases and more than 14,000 deaths (43).
Five-year survival, virtually unchanged for the past 40 years, remains at ϳ45% (44). Treatment failure is primarily due to the development of platinum resistance and cancer recurrence, and yet platinum remains the gold standard treatment. Following first-line use of platinum and in the absence of robust molecular guidance, nonspecific drugs with broad effects are essentially chosen in a formulaic manner for treating recurrences (45,46). These second-line agents are selected without nationally/internationally recognized guidelines and are known to be significantly less effective, having survival im-provements measurable usually in single months, and are associated with significant hematologic and nonhematologic toxicities (47)(48)(49)(50)(51)(52)(53).
Despite the tremendous efforts on reproducible delineation of molecular HGSOC subsets through large-scale next-generation sequencing profilings (28, 54 -56), understanding of the mechanisms of oncogenic alterations of HGSOC is still limited and there is still a lack of therapeutically actionable genomic alterations in tumors (53,57). Novel strategies are needed to gain insights that could lead to new treatment targets. The proposed analysis is an attempt to address this challenge by further integrating proteomics information with genomics information. By leveraging all the available molecular-level information in iProFun, we are able to identify and prioritize important alterations in the genome that have multiple functional consequences. Considering the well-known, low efficiency (at less than 5%) of translating even pharmaselected candidate therapeutic oncology targets into clinical practice, admittedly the targets discovered by iProFun will still need to be further validated and tested. However, the method provides a new way to identify candidates with known functional consequences in tumor for cancer development, progression, and resistance. Results from this analysis may help to nominate or prioritize novel candidate drug targets for ovarian cancer patients.
Specifically, using iProFun, we identified 117 CNAs that impact all levels of molecular QTs, i.e. mRNA, global, and phosphoprotein abundances, our definition of "cascade" ciseffects. This set should be enriched for biologically relevant cancer genes, as CNAs with preserved functional consequences are more likely to be cancer drivers. A network analysis of these 117 genes using the STRING database directed our attention to the gene AKT1, a key hub in the network, which interacts with many other cascade CNA genes (Supplemental Fig. S5). AKT1 is an effector in the PI3K/RAS pathway, which is deregulated in nearly half of all HGSOC cases (28), and down-regulation of its phosphoprotein was found to be associated with poor survival outcomes in the original TCGA-CPTAC ovarian study (16). While the impact of AKT1 copy number alteration was not discussed in the previous ovarian studies (16), iProFun results suggest that downregulation of AKT1 phosphopeptides are associated with DNA copy number losses of the gene. This potentially therapeutically targetable association/pathway, as well as the other 116 cascade CNA cis-associations, however, would not have been detected if only CNA phosphoproteomics data alone would have been analyzed due to the challenge of highdimension and low sample size in such investigations (Fig. 5).
Another interesting finding in iProFun results is the cascade effect of one methylation site cg10590292 within the bridging integrator 2 (BIN2) gene. BIN2, also called breast cancerassociated protein1, encodes a cytoplasmic protein, which influences podosome formation, motility, and phagocytosis via its interaction with the cell membrane and cytoskeleton

Proteomic and Genomic Integrative Analysis Via iProFun
Molecular & Cellular Proteomics 18.14 S61 (58) and relates to the innate immune system pathway. While the role of BIN2 in cancer is presently unknown, associations between up-regulation of BIN2 and favorable survival outcomes have been observed in all cervical, endometrial, breast, and ovarian cancers in TCGA studies (p value ϭ 0.0001, 0.0006, 0.008, and 0.075, respectively (59)). Our analysis further suggests that the expression levels as well as protein abundances of BIN2 were suppressed by DNA methylation in a subset of ovarian patients. Intriguingly, this implies a possible mechanism affecting immune invasion in ovarian tumors. A few other genes with biologically interesting association patterns identified by iProFun include CANX, RBM15, EML4, CDH6, MAP2, and KRT8. All have been previously shown to play different and important roles in cancers, but only CDH6 has been previously linked to ovarian cancer. CDH6 encodes a member of the cadherin superfamily. Cadherins are calcium-dependent cell adhesion proteins that play critical roles in cell differentiation and morphogenesis. CHD6 has been shown to be both highly differentially expressed in ovarian cancer and, taking advantage of its surface expression, demonstrated to be a unique therapeutic target for antibody-drug conjugates (60).
As already noted, in part, one of our reasons for targeting ovarian cancer in our studies was the current relatively bleak landscape for novel therapies. Thus, it is notable that our analyses have identified a number of therapeutic candidates. All of AKT1, KRT8, and MAP2 are druggable genes with approved drugs already on the market with indications for other tumors (61). Our results suggest that integrating multiple-omics data to screen for genes whose DNA alterations have significant impact on functional molecular traits can be a very effective strategy to nominate candidate genes contributing to the disease. While we believe that CNAs and DNA methylations, which play key roles in disease etiology of cancer, should preserve functional consequences, not all genetic alteration events with functional impacts are disease relevant. Thus, after iProFun is performed to nominate disease relevant genes, further investigation leveraging additional information, such as patient outcome data and gene-gene interaction network could further help to pinpoint the most promising candidates. In addition, the observed protein unique associations in our paper may be identified due to various biological reasons as well as power issues. Future investigation, such as comparing half-lives of corresponding mRNA and protein levels in tumor cells, will help to reveal the underlying biological mechanisms for protein unique associations.
Beyond the current studies, where we focused on the functional regulations of somatic CNA and DNA methylation, iProFun provides a general framework that can be easily extended to a wide range of applications. For example, by considering the same molecular quantitative trait from three subtypes as if it were from three omics data types with no overlapping samples, we could identify CNAs and methylations whose functional consequences are shared across sub-types versus CNAs and methylations whose functional consequences are unique to one or some of the subtypes. We could also extend the tool to germline variations, adding additional molecular QTs, and/or trans associations analysis. To provide a balanced perspective, we also note potential limitations of our analysis. First, iProFun is based on a linear regression framework and calculated the posterior probabilities using t-distributions, which might not be directly applicable to analyses that follow other distributions (e.g. 2 , F, or uniform distributions). Second, iProFun requires a relatively large number of genomic features, such as CNAs and DNA methylations, to estimate the density under the alternative distribution and robustly calculate the posterior probabilities. In cases where only a few genes are quantified, iProFun might not provide optimal results. Third, iProFun can be applied to genes that are measured across all data types of interest and, therefore, might define up fewer genes than separate analyses. Future studies could expand iProFun to incorporate more association analysis tools (e.g. provide a p-value-based algorithm) to overcome these limitations.
Software implementing the proposed iProFun, as well as CNA, DNA methylation, mRNA, global, and phosphoprotein data used for this analysis, are available on Github https:// github.com/songxiaoyu/iProFun.

DATA AVAILABILITY
The proteomic and phosphoproteomic data were obtained from the CPTAC Data Portal https://cptac-data-portal. georgetown.edu/cptacPublic/. The somatic CNA and mRNA data from the microarray platforms were downloaded from CPTAC publications (16) and (17). The DNA methylation data were downloaded from the TCGA Firehose pipeline processed in July 2016 at the Broad Institute (http://gdac. broadinstitute.org/). The germline genotyping data were obtained from NCI's Genomic Data Commons https://gdc. cancer.gov/.