Advertisement

MOGSA: Integrative Single Sample Gene-set Analysis of Multiple Omics Data*

  • Chen Meng
    Affiliations
    Chair of Proteomics and Bioanalytics, Technische Universität München, Freising, Germany

    Bavarian Biomolecular Mass Spectrometry Center (BayBioMS), TUM, Freising, Germany
    Search for articles by this author
  • Azfar Basunia
    Affiliations
    Department of Data Science, Division of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, Massachusetts 02215
    Search for articles by this author
  • Bjoern Peters
    Affiliations
    La Jolla Institute for Allergy and Immunology, 9420 Athena Circle, La Jolla, California 92037
    Search for articles by this author
  • Author Footnotes
    §§ Current address: Roche Sequencing Solutions, 1301 Shoreway Road, Suite 300, Belmont, California 94002
    Amin Moghaddas Gholami
    Correspondence
    To whom correspondence may be addressed.
    Footnotes
    §§ Current address: Roche Sequencing Solutions, 1301 Shoreway Road, Suite 300, Belmont, California 94002
    Affiliations
    Chair of Proteomics and Bioanalytics, Technische Universität München, Freising, Germany
    Search for articles by this author
  • Bernhard Kuster
    Correspondence
    To whom correspondence may be addressed.
    Affiliations
    Chair of Proteomics and Bioanalytics, Technische Universität München, Freising, Germany

    Bavarian Biomolecular Mass Spectrometry Center (BayBioMS), TUM, Freising, Germany
    Search for articles by this author
  • Aedín C. Culhane
    Correspondence
    To whom correspondence may be addressed.
    Affiliations
    Department of Data Science, Division of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, Massachusetts 02215

    Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts 02215
    Search for articles by this author
  • Author Footnotes
    * Funding for this work was provided by DFCI BCB Research Scientist Developmental Funds, National Cancer Institute at the National Institutes of Health [grant numbers P50 CA101942-11, 1U19 AI111224-01, 1U19 AI109755-01] and Department of Defense BCRP [award number W81XWH-15-1-0013 Views and opinions of, and endorsements by the authors do not reflect those of the US Army or the Department of Defense].
    This article contains supplemental material.
    §§ Current address: Roche Sequencing Solutions, 1301 Shoreway Road, Suite 300, Belmont, California 94002
Open AccessPublished:June 26, 2019DOI:https://doi.org/10.1074/mcp.TIR118.001251
      Gene-set analysis (GSA) summarizes individual molecular measurements to more interpretable pathways or gene-sets and has become an indispensable step in the interpretation of large-scale omics data. However, GSA methods are limited to the analysis of single omics data. Here, we introduce a new computation method termed multi-omics gene-set analysis (MOGSA), a multivariate single sample gene-set analysis method that integrates multiple experimental and molecular data types measured over the same set of samples. The method learns a low dimensional representation of most variant correlated features (genes, proteins, etc.) across multiple omics data sets, transforms the features onto the same scale and calculates an integrated gene-set score from the most informative features in each data type. MOGSA does not require filtering data to the intersection of features (gene IDs), therefore, all molecular features, including those that lack annotation may be included in the analysis. Using simulated data, we demonstrate that integrating multiple diverse sources of molecular data increases the power to discover subtle changes in gene-sets and may reduce the impact of unreliable information in any single data type. Using real experimental data, we demonstrate three use-cases of MOGSA. First, we show how to remove a source of noise (technical or biological) in integrative MOGSA of NCI60 transcriptome and proteome data. Second, we apply MOGSA to discover similarities and differences in mRNA, protein and phosphorylation profiles of a small study of stem cell lines and assess the influence of each data type or feature on the total gene-set score. Finally, we apply MOGSA to cluster analysis and show that three molecular subtypes are robustly discovered when copy number variation and mRNA data of 308 bladder cancers from The Cancer Genome Atlas are integrated using MOGSA. MOGSA is available in the Bioconductor R package “mogsa.”

      Graphical Abstract

      Increasing numbers of studies report comprehensive molecular profiling using multiple different experimental approaches on the same set of biological samples. These multi-omics studies can potentially yield great insights into the complex molecular machinery of biological systems. High-throughput sequencing allows quantification of global DNA variation and whole transcriptome RNA expression (
      • Metzker M.L.
      Sequencing technologies - the next generation.
      ,
      • Ozsolak F.
      • Milos P.M.
      RNA sequencing: advances, challenges and opportunities.
      ). Mass spectrometry (MS)-based proteomics can identify and quantify most proteins expressed in human tissues or cell lines (
      • Wilhelm M.
      • Schlegl J.
      • Hahne H.
      • Gholami A.M.
      • Lieberenz M.
      • Savitski M.M.
      • Ziegler E.
      • Butzmann L.
      • Gessulat S.
      • Marx H.
      • Mathieson T.
      • Lemeer S.
      • Schnatbaum K.
      • Reimer U.
      • Wenschuh H.
      • Mollenhauer M.
      • Slotta-Huspenina J.
      • Boese J.H.
      • Bantscheff M.
      • Gerstmair A.
      • Faerber F.
      • Kuster B.
      Mass-spectrometry-based draft of the human proteome.
      ). Emerging single cell sequencing technologies enable simultaneous measurement of transcriptomes and protein markers expressed in the same cell, using CITE-seq or REAP-seq (
      • Peterson V.M.
      • Zhang K.X.
      • Kumar N.
      • Wong J.
      • Li L.
      • Wilson D.C.
      • Moore R.
      • McClanahan T.K.
      • Sadekova S.
      • Klappenbach J.A.
      Multiplexed quantification of proteins and transcripts in single cells.
      ,
      • Stoeckius M.
      • Hafemeister C.
      • Stephenson W.
      • Houck-Loomis B.
      • Chattopadhyay P.K.
      • Swerdlow H.
      • Satija R.
      • Smibert P.
      Simultaneous epitope and transcriptome measurement in single cells.
      ). Integrating, interpreting and generating biological hypothesis from such complex data sets is a considerable challenge.
      Gene-set analysis (GSA)
      The abbreviations used are: GSA, gene-set analysis; ANOVA, analysis of variance; AUC, area under the receiver operating characteristic curve; BLCA, bladder cancer; BP, biological process; CC, cellular component; CCA, canonical correlation analysis; CIA, co-inertia analysis; CLT, central limited theorem; CPTAC, Clinical Proteomic Tumor Analysis Consortium; DE, differentially expressed; DEGS, differentially expressed gene-set; EMT, Epithelial to mesenchymal transition; GIS, gene influential score; GO, gene ontology; GS, gene-set; GSEA, gene-set enrichment analysis; GSS, gene-set score; MAD, median absolute deviation; MCIA, multiple co-inertia analysis; MF, molecular function; MF, matrix factorization; MFA, multiple factorial analysis; MVA, multivariate analysis; NMM, naïve matrix multiplication; PCA, principal component analysis; ROC, Receiver operating characteristic; SVD, singular value decomposition; TCGA, the cancer genome atlas; TF, transcriptional factor; TFT, transcriptional factor target; t-SNE, t-Distributed Stochastic Neighbor Embedding.
      1The abbreviations used are: GSA, gene-set analysis; ANOVA, analysis of variance; AUC, area under the receiver operating characteristic curve; BLCA, bladder cancer; BP, biological process; CC, cellular component; CCA, canonical correlation analysis; CIA, co-inertia analysis; CLT, central limited theorem; CPTAC, Clinical Proteomic Tumor Analysis Consortium; DE, differentially expressed; DEGS, differentially expressed gene-set; EMT, Epithelial to mesenchymal transition; GIS, gene influential score; GO, gene ontology; GS, gene-set; GSEA, gene-set enrichment analysis; GSS, gene-set score; MAD, median absolute deviation; MCIA, multiple co-inertia analysis; MF, molecular function; MF, matrix factorization; MFA, multiple factorial analysis; MVA, multivariate analysis; NMM, naïve matrix multiplication; PCA, principal component analysis; ROC, Receiver operating characteristic; SVD, singular value decomposition; TCGA, the cancer genome atlas; TF, transcriptional factor; TFT, transcriptional factor target; t-SNE, t-Distributed Stochastic Neighbor Embedding.
      is widely used in the analysis of genome scale data and is often the first step in the biological interpretation of lists of genes or proteins that are differentially expressed between phenotypically distinct groups (
      • Khatri P.
      • Sirota M.
      • Butte A.J.
      Ten years of pathway analysis: current approaches and outstanding challenges.
      ). These methods use external biological information, including gene ontologies, to reduce thousands of genes or proteins into lists of gene-sets that describe cellular pathways, subcellular localization, transcription factors or miRNA targets etc., thus facilitating hypothesis generation.
      Large scale omics studies or single cell studies may have limited a priori knowledge of phenotype groups or may aim to discover new molecular subtypes in a panel of experimental conditions or tissues with complex phenotypes, exemplified by The Cancer Genome Atlas (TCGA) (
      • Cancer Genome Atlas Research
      • Weinstein N.J.N.
      • Collisson E.A.
      • Mills G.B.
      • Shaw K.R.
      • Ozenberger B.A.
      • Ellrott K.
      • Shmulevich I.
      • Sander C.
      • Stuart J.M.
      The Cancer Genome Atlas Pan-Cancer analysis project.
      ) and the Clinical Proteomic Tumor Analysis Consortium (CPTAC) (
      • Ellis M.J.
      • Gillette M.
      • Carr S.A.
      • Paulovich A.G.
      • Smith R.D.
      • Rodland K.K.
      • Townsend R.R.
      • Kinsinger C.
      • Mesri M.
      • Rodriguez H.
      • Liebler D.C.
      • Clinical Proteomic Tumor Analysis, C
      Connecting genomic alterations to cancer biology with proteomics: the NCI Clinical Proteomic Tumor Analysis Consortium.
      ). Classical GSA methods that require phenotypically distinct groups (
      • Khatri P.
      • Sirota M.
      • Butte A.J.
      Ten years of pathway analysis: current approaches and outstanding challenges.
      ) have limited application in such cases and several unsupervised, single sample GSA (ssGSA) methods have been developed (
      • Hanzelmann S.
      • Castelo R.
      • Guinney J.
      GSVA: gene set variation analysis for microarray and RNA-seq data.
      ,
      • Barbie D.A.
      • Tamayo P.
      • Boehm J.S.
      • Kim S.Y.
      • Moody S.E.
      • Dunn I.F.
      • Schinzel A.C.
      • Sandy P.
      • Meylan E.
      • Scholl C.
      • Frohling S.
      • Chan E.M.
      • Sos M.L.
      • Michel K.
      • Mermel C.
      • Silver S.J.
      • Weir B.A.
      • Reiling J.H.
      • Sheng Q.
      • Gupta P.B.
      • Wadlow R.C.
      • Le H.
      • Hoersch S.
      • Wittner B.S.
      • Ramaswamy S.
      • Livingston D.M.
      • Sabatini D.M.
      • Meyerson M.
      • Thomas R.K.
      • Lander E.S.
      • Mesirov J.P.
      • Root D.E.
      • Gilliland D.G.
      • Jacks T.
      • Hahn W.C.
      Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1.
      ,
      • Tomfohr J.
      • Lu J.
      • Kepler T.B.
      Pathway level analysis of gene expression using singular value decomposition.
      ,
      • Lee E.
      • Chuang H.Y.
      • Kim J.W.
      • Ideker T.
      • Lee D.
      Inferring pathway activity toward precise disease classification.
      ). These methods do not require prior availability of phenotypic or clinical data. Arguably, one of the most popular approaches is single-sample GSEA that ranks genes according to the empirical cumulative distribution function and calculates a single sample-wise gene-set score by comparing the scores of genes that are inside and outside a gene-set (
      • Barbie D.A.
      • Tamayo P.
      • Boehm J.S.
      • Kim S.Y.
      • Moody S.E.
      • Dunn I.F.
      • Schinzel A.C.
      • Sandy P.
      • Meylan E.
      • Scholl C.
      • Frohling S.
      • Chan E.M.
      • Sos M.L.
      • Michel K.
      • Mermel C.
      • Silver S.J.
      • Weir B.A.
      • Reiling J.H.
      • Sheng Q.
      • Gupta P.B.
      • Wadlow R.C.
      • Le H.
      • Hoersch S.
      • Wittner B.S.
      • Ramaswamy S.
      • Livingston D.M.
      • Sabatini D.M.
      • Meyerson M.
      • Thomas R.K.
      • Lander E.S.
      • Mesirov J.P.
      • Root D.E.
      • Gilliland D.G.
      • Jacks T.
      • Hahn W.C.
      Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1.
      ). A related method, gene-set variation analysis (GSVA), also calculates sample-wise gene-set enrichment as a function of the genes that are inside and outside a gene-set, and also uses a similar Kolmogorov-Smirnov-like rank statistic to assess the enrichment score, but genes are ranked using a kernel estimation of a cumulative density function (
      • Hanzelmann S.
      • Castelo R.
      • Guinney J.
      GSVA: gene set variation analysis for microarray and RNA-seq data.
      ). These single-sample GSA methods are designed for the analysis of a single data set, and do not integrate or calculate a single sample GSA score on multiple data sets simultaneously.
      Here, we present a novel unsupervised single-sample gene-set analysis that calculates an integrated enrichment score using all the information in multiple omics data sets, named “multi-omics GSA” (MOGSA). The method relies on matrix factorization (MF), powerful methods that can be used to learn patterns of biological significance in high dimensional data (
      • Stein-O'Brien G.L.
      • Arora R.
      • Culhane A.C.
      • Favorov A.V.
      • Garmire L.X.
      • Greene C.S.
      • Goff L.A.
      • Li Y.
      • Ngom A.
      • Ochs M.F.
      • Xu Y.
      • Fertig E.J.
      Enter the matrix: factorization uncovers knowledge from omics.
      ) as well as identify and exclude batch effects (
      • Leek J.T.
      • Storey J.D.
      Capturing heterogeneity in gene expression studies by surrogate variable analysis.
      ). Coupled or tensor MF methods can learn latent correlated structure within and between omics data sets (
      • Meng C.
      • Kuster B.
      • Culhane A.C.
      • Gholami A.M.
      A multivariate approach to the integration of multi-omics datasets.
      ,
      • de Tayrac M.
      • Le S.
      • Aubry M.
      • Mosser J.
      • Husson F.
      Simultaneous analysis of distinct Omics data sets with integration of biological knowledge: Multiple Factor Analysis approach.
      ,
      • Fagan A.
      • Culhane A.C.
      • Higgins D.G.
      A multivariate analysis approach to the integration of proteomic and gene expression data.
      ,
      • Le Cao K.A.
      • Martin P.G.
      • Robert-Granie C.
      • Besse P.
      Sparse canonical methods for biological data integration: application to a cross-platform study.
      ) and have been applied to the analysis of molecular data from different technology platforms (
      • Culhane A.C.
      • Perriere G.
      • Higgins D.G.
      Cross-platform comparison and visualisation of gene expression data using co-inertia analysis.
      ) and integration of diverse multi-omics data (
      • Meng C.
      • Kuster B.
      • Culhane A.C.
      • Gholami A.M.
      A multivariate approach to the integration of multi-omics datasets.
      ,
      • Fagan A.
      • Culhane A.C.
      • Higgins D.G.
      A multivariate analysis approach to the integration of proteomic and gene expression data.
      ).
      An attractive characteristic of coupled or tensor MF approaches, is that they identify global correlated patterns among samples or observations. They can be applied to integrating data from experimental platforms that include known and unknown molecules (for example lipidomics, metabolomics) or molecules that are difficult to map one-to-one between data sets (e.g. transcript variants to proteins). Therefore, these approaches do not require pre-filtering of gene identifiers in each data set to a common intersecting subset of features. Although coupled MF or latent factor methods are powerful, they identify components, the interpretation of which, can be challenging and may require domain knowledge (
      • Meng C.
      • Kuster B.
      • Culhane A.C.
      • Gholami A.M.
      A multivariate approach to the integration of multi-omics datasets.
      ,
      • Abdi H.
      • Williams L.J.
      • Valentin D.
      Multiple factor analysis: principal component analysis for multitable and multiblock data sets.
      ,
      • Meng C.
      • Zeleznik O.A.
      • Thallinger G.G.
      • Kuster B.
      • Gholami A.M.
      • Culhane A.C.
      Dimension reduction techniques for the integrative analysis of multi-omics data.
      ,
      • Tenenhaus A.
      • Tenenhaus M.
      Regularized generalized canonical correlation analysis.
      ). To solve this problem, MOGSA incorporates gene-set annotation in the correlated patterns of molecules resulting from MF, calculates scores for gene-sets in each biological sample, providing simple, accessible biological interpretation. We showed that integrative ssGSA by MOGSA has higher sensitivity and specificity for the detection of differentially expressed gene-sets compared with popular ssGSA approaches when applied to simulated data. To demonstrate result interpretation and application, we applied MOGSA to both small- and large-scale biological data from high throughput experiments.

      DISCUSSION

      In this manuscript, we introduced a new approach for multi-omics ssGSA, termed MOGSA that enables the discovery of e.g. biological pathways with correlated profiles across multiple complex data sets. MOGSA uses tensor MF or multivariate latent variable analysis to explore correlated global variance structure across data sets and then extracts the gene-sets or pathways with the highest variance that are most strongly associated with this correlated structure across observations. By combining multiple data types, we can compensate for missing or unreliable information in any single data type so we may find gene-sets that cannot be detected by single omics data analysis alone (
      • Meng C.
      • Kuster B.
      • Culhane A.C.
      • Gholami A.M.
      A multivariate approach to the integration of multi-omics datasets.
      ).
      MOGSA is fundamentally different from other gene-set enrichment analysis methods which use a ‘within observation summarization’ such as the mean or median of gene expression of genes in a gene-set (
      • Hanzelmann S.
      • Castelo R.
      • Guinney J.
      GSVA: gene set variation analysis for microarray and RNA-seq data.
      ,
      • Barbie D.A.
      • Tamayo P.
      • Boehm J.S.
      • Kim S.Y.
      • Moody S.E.
      • Dunn I.F.
      • Schinzel A.C.
      • Sandy P.
      • Meylan E.
      • Scholl C.
      • Frohling S.
      • Chan E.M.
      • Sos M.L.
      • Michel K.
      • Mermel C.
      • Silver S.J.
      • Weir B.A.
      • Reiling J.H.
      • Sheng Q.
      • Gupta P.B.
      • Wadlow R.C.
      • Le H.
      • Hoersch S.
      • Wittner B.S.
      • Ramaswamy S.
      • Livingston D.M.
      • Sabatini D.M.
      • Meyerson M.
      • Thomas R.K.
      • Lander E.S.
      • Mesirov J.P.
      • Root D.E.
      • Gilliland D.G.
      • Jacks T.
      • Hahn W.C.
      Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1.
      ,
      • Argelaguet R.
      • Velten B.
      • Arnol D.
      • Dietrich S.
      • Zenz T.
      • Marioni J.C.
      • Buettner F.
      • Huber W.
      • Stegle O.
      Multi-Omics Factor Analysis-a framework for unsupervised integration of multi-omics data sets.
      ). MOGSA has several unique characteristics that make it well suited for generating integrated multi-omics gene-set scores. First, MOGSA uses MFA, a multitable extension of PCA to reduce the complexity of the original data by transforming high dimensional data to a low dimensional representation of the data on a few orthogonal components (latent variables). The components with the highest eigenvalues capture the most prominent or variant structure that is shared among the different data sets. Keeping the first few components (with high variance) and excluding lower ranked components that may be associated with noise or artifacts (
      • Chang W.-C.
      On using principal components before separating a mixture of two multivariate normal distributions.
      ,
      • Alter O.
      • Brown P.O.
      • Botstein D.
      Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms.
      ) may increase the signal-to-noise ratio and sensitivity of data analysis. In MOGSA, the entire set of features from each platform is decomposed onto a lower dimension space. The linear combination of feature loadings is used in the calculation of the gene-set scores. Features that contribute low variance contribute little to the score and thus the dimension reduction within MOGSA provides an intrinsic filtering of noise. The advantages of intrinsic variance filtering of features can be clearly seen when we applied MOGSA to simulated data. Second, data integration of features is achieved at the gene-sets level rather than scoring individual features. This greatly facilitates the biological interpretation among multiple integrated data sets. There is no requirement to pre-filter features in a study or map features from different data sets to a set of common genes. Therefore, MOGSA can be used to compare technology platforms that have different or missing features.
      There is great potential for applying multitable unsupervised GSA approaches for discovery of new subtypes and pathways in integrated data analysis of complex diseases such as cancer. In this study, we applied MOGSA in combination with clustering analysis. Recent studies comparing algorithms clustering multi-omics data have confirmed a good performance of matrix-factorization-based algorithm in terms of speed and cluster accuracy (
      • Meng C.
      • Helm D.
      • Frejno M.
      • Kuster B.
      moCluster: Identifying joint patterns across multiple omics data sets.
      ,
      • Rappoport N.
      • Shamir R.
      Multi-omic and multi-view clustering algorithms: review and cancer benchmark.
      ,
      • Chauvel C.
      • Novoloaca A.
      • Veyre P.
      • Reynier F.
      • Becker J.
      Evaluation of integrative clustering methods for the analysis of multi-omics data.
      ). This may because these approaches consider the global variance in the data and as such are complementary to hierarchical or k-means clustering approaches which focus on the pair-wise distance between observations (
      • Alter O.
      • Brown P.O.
      • Botstein D.
      Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms.
      ,
      • Hastie T.
      • Tibshirani R.
      • Eisen M.B.
      • Alizadeh A.
      • Levy R.
      • Staudt L.
      • Chan W.C.
      • Botstein D.
      • Brown P.
      'Gene shaving’ as a method for identifying distinct sets of genes with similar expression patterns.
      ,
      • Holter N.S.
      • Mitra M.
      • Maritan A.
      • Cieplak M.
      • Banavar J.R.
      • Fedoroff N.V.
      Fundamental patterns underlying gene expression profiles: simplicity from complexity.
      ,
      • Brazma A.A.C.C.
      Algorithms for gene expression analysis.
      ).
      The number of components is an important input parameter to consider when applying MOGSA to gene-set analysis or cluster discovery. Like PCA, the optimal number of MFA components may be assessed by examining the variance associated with each component. The first component will capture most variance and the variance associated with subsequent component decreases monotonically. Scree plots (Fig. 2D, 5A) may be used to visualize if there is an elbow point in the eigenvalues, allowing one to select the components before the elbow point. Alternatively, one may select the number of components that capture a certain proportion of variance (50%, 70%, etc.). In addition, one may include components that are of biological interest. For example, in the iPS ES example, there is a clear biological meaning in the third component (ES versus iPS cell line). Permutation analysis is a more objective approach to select components in PCA and factor analysis (
      • Franklin S.B.
      • Gibson D.J.
      • Robertson P.A.
      • Pohlmann J.T.
      • Fralish J.S.
      Parallel Analysis: a method for determining significant principal components.
      ). We used a permutation-based method in MOGSA of NCI60 data set, i.e. the samples in each omics data set are randomly permutated multiple times and a null distribution of variance associated with each component is calculated from these permutated samples. This method can be used to identify the components representing common structures in multiple omics data sets and requires the least subjective interpretation. Therefore, we implemented this in the MOGSA package.
      In the analysis of the BLCA data, we examined components 1 to 12 and showed that there is little gain of information once a minimum number of components with high variance are included (Fig. 6B). Users should also consider that the variance of retained components should not be dominated by one or a few data sets. To facilitate biological interpretation of components, the GSS could be decomposed regarding components. In the BLCA example, the second and forth component are largely contributed by CNV, whereas mRNA is more important in defining the third and fifth components. Including five components ensured that both data sets contributed relatively similar variance to the total gene-set score.
      An issue might arise with latent variables analysis if components with large variance capture information unrelated to biological variance such as technical artifacts or batch effects. In practice, this is rare in MFA, because it focuses on components that capture global correlation among all data sets. Often batch effects are specific to a platform and thus a component that captures information that is entirely uncorrelated to the global structure will be omitted from the set of highly variant integrated components. Decomposition of gene-sets by components makes it is easy to remove unwanted variance, and gene-set scores calculated from selected components, reflecting variance of interest, may lead to the better interpretation of interesting biology. However, it is still wise to perform careful batch effect control on individual data sets. To this end, Surrogate Variable Analysis (SVA) (
      • Leek J.T.
      • Storey J.D.
      Capturing heterogeneity in gene expression studies by surrogate variable analysis.
      ) can be used to detect unwanted variance in a single data set where the samples groups are known a priori (e.g. disease versus normal tissues). However, SVA is not optimal in experiments like the BLCA where predefined groups are unknown. In multi-omics experiments, a data set with platform-specific batch effects can be detected by calculating pair-wise RV coefficients (
      • Smilde A.K.
      • Kiers H.A.
      • Bijlsma S.
      • Rubingh C.M.
      • van Erk M.J.
      Matrix correlations for high-dimensional data: the modified RV-coefficient.
      ) between data sets. RV coefficients, ranging from 0 (low similarity) to 1 (high similarity), are a generalization of Pearson correlation coefficients used to quantify the similarity between two matrices. A data set with consistently lower RV coefficients when compared with other data sets may contain some data set-specific biology or technology batch effect. To include such a data set in MOGSA, the weight or relative importance of that data set to the integrative analysis can be reduced or normalized using STATIS (
      • Abdi H.
      • Williams L.J.
      • Valentin D.
      • Bennani-Dosse M.
      STATIS and DISTATIS: optimum multitable principal component analysis and three way metric multidimensional scaling.
      ) (which is implemented in MOGSA). STATIS normalizes data sets to give a greater weight to a data set with variance close to most data sets and a lower weight to a data set deviating from the majority. As a result, the resulting components are more likely represent all data sets rather than being driven by one outlier data set.
      When batch effect or other unwanted variance are detected, MOGSA may calculate gene-set scores based on selected components to enrich for specific patterns or exclude attributes. For example, we excluded the first component in the NCI60 cell line data which was associated with cell doubling time. Another consideration when applying MOGSA is that it is most efficient in detecting gene-sets that have broad correlation patterns among data types. It may fail to discover gene-sets with few genes, particularly if they had low variances on the selected components.
      MOGSA relies on a linear multivariate analysis method, that means that the resulting components and feature contributions capture linear correlation structure between multiple omics data sets. A potential nonlinear structure may not be captured by MOGSA. However, linear approximation of omics data sets is computational efficient and has been successfully applied to find prominent structure in noisy omics data (
      • Meng C.
      • Kuster B.
      • Culhane A.C.
      • Gholami A.M.
      A multivariate approach to the integration of multi-omics datasets.
      ,
      • de Tayrac M.
      • Le S.
      • Aubry M.
      • Mosser J.
      • Husson F.
      Simultaneous analysis of distinct Omics data sets with integration of biological knowledge: Multiple Factor Analysis approach.
      ). T-Distributed Stochastic Neighbor Embedding (t-SNE) is a nonlinear dimension reduction method widely used in omics data analysis (
      • van der Maaten L.
      • Hinton G.
      Visualizing Datausing t-SNE.
      ). It preserves polynomial structure in omics data sets and projects the nonlinear variance onto a lower dimensional space. However, t-SNE is not well suited for GSA, because it uses a stochastic procedure and contributions of individual features cannot be explicitly assessed. A more promising extension of MOGSA to account for nonlinear structure is to use a kernel function in the MFA step (
      • Mariette J.
      • Villa-Vialaneix N.
      Unsupervised multiple kernel learning for heterogeneous data integration.
      ). However, extending MOGSA to nonlinear kernels will considerably increase computational time because it dramatically expands the parameter space that needs to be optimized. Therefore, combining nonlinear kernels and MFA must be further evaluated in omics data integration.
      Sample size is also a consideration when applying MOGSA. Although MOGSA is a single sample GSA method, it should not be used when only one sample is measured, because MOGSA normalizes measured values across multiple samples (feature-wise normalization). This is like ssGSEA and GSVA which use multiple samples to calculate a cumulative density function. In addition, MOGSA should be applied with caution when the sample size is too small (i.e. n < 3) because the feature-wise normalization process tends to be unstable in these cases. The current implementation of MOGSA employs classical MF and is restricted to studies where the samples are measured in all data sets, but missing data would be better accommodated if it were extended to use a Bayesian MF approach, such a group factor analysis, which has been successfully applied to multi-omics data analysis (
      • Argelaguet R.
      • Velten B.
      • Arnol D.
      • Dietrich S.
      • Zenz T.
      • Marioni J.C.
      • Buettner F.
      • Huber W.
      • Stegle O.
      Multi-Omics Factor Analysis-a framework for unsupervised integration of multi-omics data sets.
      ).
      Although the matrix factorization step does not require “ID mapping” in MOGSA, it is still required in the gene-set annotation step. Hence, the features included in gene-set scores calculations are feature annotated to gene-set and thus are limited by the incompleteness and inaccuracy of gene-set annotation databases. Currently most databases only include information on the gene or protein level and do not include transcript level, site level of PTMs or metabolites. Therefore, we used protein level annotation to annotate phospho-sites in this study. When more annotations are available, MOGSA can easily incorporate them through defining gene-set annotation matrices based on different levels of information. This would allow recent gene-set collections at the PTM level (
      • Krug K.
      • Mertins P.
      • Zhang B.
      • Hornbeck P.
      • Raju R.
      • Ahmad R.
      • Szucs M.
      • Mundt F.
      • Forestier D.
      • Jane-Valbuena J.
      • Keshishian H.
      • Gillette M.A.
      • Tamayo P.
      • Mesirov J.P.
      • Jaffe J.D.
      • Carr S.A.
      • Mani D.R.
      A curated resource for phosphosite-specific signature analysis.
      ) to be used. Equally, MOGSA can learn or predict the function of unknown or poorly annotated biomolecules using by “guilt-by-association” inference, which could be used to extend gene-set databases to multi-omics complexity.
      Finally, MOGSA is computationally efficient when applied to large omics integrative analysis of mRNA and CNV data of over 10,000 tumors. Open source, well documented code is provided in the Bioconductor package MOGSA which includes a detailed vignette tutorial and example data sets. Although we do not include examples applied to multi-omics single cell sequencing technologies that provide simultaneous measurement of transcriptomes and protein markers expressed in the same cell, e.g. CITE-seq or REAP-seq (
      • Peterson V.M.
      • Zhang K.X.
      • Kumar N.
      • Wong J.
      • Li L.
      • Wilson D.C.
      • Moore R.
      • McClanahan T.K.
      • Sadekova S.
      • Klappenbach J.A.
      Multiplexed quantification of proteins and transcripts in single cells.
      ,
      • Stoeckius M.
      • Hafemeister C.
      • Stephenson W.
      • Houck-Loomis B.
      • Chattopadhyay P.K.
      • Swerdlow H.
      • Satija R.
      • Smibert P.
      Simultaneous epitope and transcriptome measurement in single cells.
      ), MOGSA can be applied to integrating, interpreting and generating biological hypothesis from complex high dimensional data sets from bulk sequencing or single cell technologies.

      Data Availability

      NCI60 proteomics data (http://129.187.44.58:7070/NCI60/ or https://www.ebi.ac.uk/pride/archive/projects/PXD005946). NCI60 transcriptomics data (https://discover.nci.nih.gov/cellminer/ (RNA: 5 Platform Gene Transcript (average z score)). IPS ES 4-plex data: supplemental Table S1, S2, and S5 from http://scor.chem.wisc.edu/data.php. BLCA GISTIC data (https://gdac.broadinstitute.org/ (Bladder urothelial carcinoma)). BLCA CNV and mRNA data (https://portal.gdc.cancer.gov/).

      Acknowledgments

      We thank Prof. Joaquim Bellmunt for the insightful discussions about bladder cancer molecular subtypes and treatment. We also thank Martin Frejno, Hannes Hanne and Dominic Helm for reading the manuscript and giving valuable suggestions.

      REFERENCES

        • Metzker M.L.
        Sequencing technologies - the next generation.
        Nat. Rev. Genet. 2010; 11: 31-46
        • Ozsolak F.
        • Milos P.M.
        RNA sequencing: advances, challenges and opportunities.
        Nat. Rev. Genet. 2011; 12: 87-98
        • Wilhelm M.
        • Schlegl J.
        • Hahne H.
        • Gholami A.M.
        • Lieberenz M.
        • Savitski M.M.
        • Ziegler E.
        • Butzmann L.
        • Gessulat S.
        • Marx H.
        • Mathieson T.
        • Lemeer S.
        • Schnatbaum K.
        • Reimer U.
        • Wenschuh H.
        • Mollenhauer M.
        • Slotta-Huspenina J.
        • Boese J.H.
        • Bantscheff M.
        • Gerstmair A.
        • Faerber F.
        • Kuster B.
        Mass-spectrometry-based draft of the human proteome.
        Nature. 2014; 509: 582-587
        • Peterson V.M.
        • Zhang K.X.
        • Kumar N.
        • Wong J.
        • Li L.
        • Wilson D.C.
        • Moore R.
        • McClanahan T.K.
        • Sadekova S.
        • Klappenbach J.A.
        Multiplexed quantification of proteins and transcripts in single cells.
        Nat. Biotechnol. 2017; 35: 936-939
        • Stoeckius M.
        • Hafemeister C.
        • Stephenson W.
        • Houck-Loomis B.
        • Chattopadhyay P.K.
        • Swerdlow H.
        • Satija R.
        • Smibert P.
        Simultaneous epitope and transcriptome measurement in single cells.
        Nat. Methods. 2017; 14: 865-868
        • Khatri P.
        • Sirota M.
        • Butte A.J.
        Ten years of pathway analysis: current approaches and outstanding challenges.
        PLoS Comput. Biol. 2012; 8: e1002375
        • Cancer Genome Atlas Research
        • Weinstein N.J.N.
        • Collisson E.A.
        • Mills G.B.
        • Shaw K.R.
        • Ozenberger B.A.
        • Ellrott K.
        • Shmulevich I.
        • Sander C.
        • Stuart J.M.
        The Cancer Genome Atlas Pan-Cancer analysis project.
        Nat. Genet. 2013; 45: 1113-1120
        • Ellis M.J.
        • Gillette M.
        • Carr S.A.
        • Paulovich A.G.
        • Smith R.D.
        • Rodland K.K.
        • Townsend R.R.
        • Kinsinger C.
        • Mesri M.
        • Rodriguez H.
        • Liebler D.C.
        • Clinical Proteomic Tumor Analysis, C
        Connecting genomic alterations to cancer biology with proteomics: the NCI Clinical Proteomic Tumor Analysis Consortium.
        Cancer Discov. 2013; 3: 1108-1112
        • Hanzelmann S.
        • Castelo R.
        • Guinney J.
        GSVA: gene set variation analysis for microarray and RNA-seq data.
        BMC Bioinformatics. 2013; 14: 7
        • Barbie D.A.
        • Tamayo P.
        • Boehm J.S.
        • Kim S.Y.
        • Moody S.E.
        • Dunn I.F.
        • Schinzel A.C.
        • Sandy P.
        • Meylan E.
        • Scholl C.
        • Frohling S.
        • Chan E.M.
        • Sos M.L.
        • Michel K.
        • Mermel C.
        • Silver S.J.
        • Weir B.A.
        • Reiling J.H.
        • Sheng Q.
        • Gupta P.B.
        • Wadlow R.C.
        • Le H.
        • Hoersch S.
        • Wittner B.S.
        • Ramaswamy S.
        • Livingston D.M.
        • Sabatini D.M.
        • Meyerson M.
        • Thomas R.K.
        • Lander E.S.
        • Mesirov J.P.
        • Root D.E.
        • Gilliland D.G.
        • Jacks T.
        • Hahn W.C.
        Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1.
        Nature. 2009; 462: 108-112
        • Tomfohr J.
        • Lu J.
        • Kepler T.B.
        Pathway level analysis of gene expression using singular value decomposition.
        BMC Bioinformatics. 2005; 6: 225
        • Lee E.
        • Chuang H.Y.
        • Kim J.W.
        • Ideker T.
        • Lee D.
        Inferring pathway activity toward precise disease classification.
        PLoS Comput. Biol. 2008; 4: e1000217
        • Stein-O'Brien G.L.
        • Arora R.
        • Culhane A.C.
        • Favorov A.V.
        • Garmire L.X.
        • Greene C.S.
        • Goff L.A.
        • Li Y.
        • Ngom A.
        • Ochs M.F.
        • Xu Y.
        • Fertig E.J.
        Enter the matrix: factorization uncovers knowledge from omics.
        Trends Genet. 2018; 34: 790-805
        • Leek J.T.
        • Storey J.D.
        Capturing heterogeneity in gene expression studies by surrogate variable analysis.
        PLoS Genet. 2007; 3: 1724-1735
        • Meng C.
        • Kuster B.
        • Culhane A.C.
        • Gholami A.M.
        A multivariate approach to the integration of multi-omics datasets.
        BMC Bioinformatics. 2014; 15: 162
        • de Tayrac M.
        • Le S.
        • Aubry M.
        • Mosser J.
        • Husson F.
        Simultaneous analysis of distinct Omics data sets with integration of biological knowledge: Multiple Factor Analysis approach.
        BMC Genomics. 2009; 10: 32
        • Fagan A.
        • Culhane A.C.
        • Higgins D.G.
        A multivariate analysis approach to the integration of proteomic and gene expression data.
        Proteomics. 2007; 7: 2162-2171
        • Le Cao K.A.
        • Martin P.G.
        • Robert-Granie C.
        • Besse P.
        Sparse canonical methods for biological data integration: application to a cross-platform study.
        BMC Bioinformatics. 2009; 10: 34
        • Culhane A.C.
        • Perriere G.
        • Higgins D.G.
        Cross-platform comparison and visualisation of gene expression data using co-inertia analysis.
        BMC Bioinformatics. 2003; 4: 59
        • Abdi H.
        • Williams L.J.
        • Valentin D.
        Multiple factor analysis: principal component analysis for multitable and multiblock data sets.
        Wiley Interdisciplinary Reviews: Computational Statistics. 2013; 5: 149-179
        • Meng C.
        • Zeleznik O.A.
        • Thallinger G.G.
        • Kuster B.
        • Gholami A.M.
        • Culhane A.C.
        Dimension reduction techniques for the integrative analysis of multi-omics data.
        Brief Bioinform. 2016; 17: 628-641
        • Tenenhaus A.
        • Tenenhaus M.
        Regularized generalized canonical correlation analysis.
        Psychometrika. 2011; 76: 257-284
        • Shankavaram U.T.
        • Varma S.
        • Kane D.
        • Sunshine M.
        • Chary K.K.
        • Reinhold W.C.
        • Pommier Y.
        • Weinstein J.N.
        CellMiner: a relational database and query tool for the NCI-60 cancer cell lines.
        BMC Genomics. 2009; 10: 277
        • Gholami A.M.
        • Hahne H.
        • Wu Z.
        • Auer F.J.
        • Meng C.
        • Wilhelm M.
        • Kuster B.
        Global proteome analysis of the NCI-60 cell line panel.
        Cell Rep. 2013; 4: 609-620
        • Schwanhausser B.
        • Busse D.
        • Li N.
        • Dittmar G.
        • Schuchhardt J.
        • Wolf J.
        • Chen W.
        • Selbach M.
        Global quantification of mammalian gene expression control.
        Nature. 2011; 473: 337-342
        • Phanstiel D.H.
        • Brumbaugh J.
        • Wenger C.D.
        • Tian S.
        • Probasco M.D.
        • Bailey D.J.
        • Swaney D.L.
        • Tervo M.A.
        • Bolin J.M.
        • Ruotti V.
        • Stewart R.
        • Thomson J.A.
        • Coon J.J.
        Proteomic and phosphoproteomic comparison of human ES and iPS cells.
        Nat. Methods. 2011; 8: 821-827
        • Zwiener I.
        • Frisch B.
        • Binder H.
        Transforming RNA-Seq data to improve the performance of prognostic gene signatures.
        PLoS ONE. 2014; 9: e85150
        • Wenger C.D.
        • Phanstiel D.H.
        • Lee M.V.
        • Bailey D.J.
        • Coon J.J.
        COMPASS: a suite of pre- and post-search proteomics software tools for OMSSA.
        Proteomics. 2011; 11: 1064-1074
        • Zhu Y.
        • Qiu P.
        • Ji Y.
        TCGA-assembler: open-source software for retrieving and processing TCGA data.
        Nat. Methods. 2014; 11: 599-600
        • Wang K.
        • Singh D.
        • Zeng Z.
        • Coleman S.J.
        • Huang Y.
        • Savich G.L.
        • He X.
        • Mieczkowski P.
        • Grimm S.A.
        • Perou C.M.
        • MacLeod J.N.
        • Chiang D.Y.
        • Prins J.F.
        • Liu J.
        MapSplice: accurate mapping of RNA-seq reads for splice junction discovery.
        Nucleic Acids Res. 2010; 38: e178
        • Li B.
        • Ruotti V.
        • Stewart R.M.
        • Thomson J.A.
        • Dewey C.N.
        RNA-Seq gene expression estimation with read mapping uncertainty.
        Bioinformatics. 2010; 26: 493-500
        • Olshen A.B.
        • Venkatraman E.S.
        • Lucito R.
        • Wigler M.
        Circular binary segmentation for the analysis of array-based DNA copy number data.
        Biostatistics. 2004; 5: 557-572
        • Mermel C.H.
        • Schumacher S.E.
        • Hill B.
        • Meyerson M.L.
        • Beroukhim R.
        • Getz G.
        GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers.
        Genome Biol. 2011; 12: R41
        • Monti S.
        • Tamayo P.
        • Mesirov J.
        • Golub T.
        Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data.
        Machine Learning. 2003; 52: 28
        • Wilkerson M.D.
        • Hayes D.N.
        ConsensusClusterPlus: a class discovery tool with confidence assessments and item tracking.
        Bioinformatics. 2010; 26: 1572-1573
        • Senbabaoglu Y.
        • Michailidis G.
        • Li J.Z.
        Critical limitations of consensus clustering in class discovery.
        Sci. Rep. 2014; 4: 6207
        • Tibshirani R.
        • Walther G.
        Cluster Validation by Prediction Strength.
        J. Computational Graphical Statistics. 2005; 14: 511-528
        • Sjodahl G.
        • Lauss M.
        • Lovgren K.
        • Chebil G.
        • Gudjonsson S.
        • Veerla S.
        • Patschan O.
        • Aine M.
        • Ferno M.
        • Ringner M.
        • Mansson W.
        • Liedberg F.
        • Lindgren D.
        • Hoglund M.
        A molecular taxonomy for urothelial carcinoma.
        Clin. Cancer Res. 2012; 18: 3377-3386
        • Liberzon A.
        • Subramanian A.
        • Pinchback R.
        • Thorvaldsdottir H.
        • Tamayo P.
        • Mesirov J.P.
        Molecular signatures database (MSigDB) 3.0.
        Bioinformatics. 2011; 27: 1739-1740
        • Argelaguet R.
        • Velten B.
        • Arnol D.
        • Dietrich S.
        • Zenz T.
        • Marioni J.C.
        • Buettner F.
        • Huber W.
        • Stegle O.
        Multi-Omics Factor Analysis-a framework for unsupervised integration of multi-omics data sets.
        Mol. Syst. Biol. 2018; 14: e8124
        • Busold C.H.
        • Winter S.
        • Hauser N.
        • Bauer A.
        • Dippon J.
        • Hoheisel J.D.
        • Fellenberg K.
        Integration of GO annotations in Correspondence Analysis: facilitating the interpretation of microarray data.
        Bioinformatics. 2005; 21: 2424-2429
        • Leek J.T.
        • Scharpf R.B.
        • Bravo H.C.
        • Simcha D.
        • Langmead B.
        • Johnson W.E.
        • Geman D.
        • Baggerly K.
        • Irizarry R.A.
        Tackling the widespread and critical impact of batch effects in high-throughput data.
        Nat. Rev. Genet. 2010; 11: 733-739
        • Aran D.
        • Sirota M.
        • Butte A.J.
        Systematic pan-cancer analysis of tumour purity.
        Nat. Commun. 2015; 6: 8971
        • McDavid A.
        • Finak G.
        • Gottardo R.
        The contribution of cell cycle to heterogeneity in single-cell RNA-seq data.
        Nat. Biotechnol. 2016; 34: 591-593
        • Kenny P.A.
        • Lee G.Y.
        • Myers C.A.
        • Neve R.M.
        • Semeiks J.R.
        • Spellman P.T.
        • Lorenz K.
        • Lee E.H.
        • Barcellos-Hoff M.H.
        • Petersen O.W.
        • Gray J.W.
        • Bissell M.J.
        The morphologies of breast cancer cell lines in three-dimensional assays correlate with their profiles of gene expression.
        Mol. Oncol. 2007; 1: 84-96
        • Knowles M.A.
        • Hurst C.D.
        Molecular biology of bladder cancer: new insights into pathogenesis and clinical diversity.
        Nat. Rev. Cancer. 2015; 15: 25-41
        • Robertson A.G.
        • Kim J.
        • Al-Ahmadie H.
        • Bellmunt J.
        • Guo G.
        • Cherniack A.D.
        • Hinoue T.
        • Laird P.W.
        • Hoadley K.A.
        • Akbani R.
        • Castro M.A.A.
        • Gibb E.A.
        • Kanchi R.S.
        • Gordenin D.A.
        • Shukla S.A.
        • Sanchez-Vega F.
        • Hansel D.E.
        • Czerniak B.A.
        • Reuter V.E.
        • Su X.
        • de Sa Carvalho B.
        • Chagas V.S.
        • Mungall K.L.
        • Sadeghi S.
        • Pedamallu C.S.
        • Lu Y.
        • Klimczak L.J.
        • Zhang J.
        • Choo C.
        • Ojesina A.I.
        • Bullman S.
        • Leraas K.M.
        • Lichtenberg T.M.
        • Wu C.J.
        • Schultz N.
        • Getz G.
        • Meyerson M.
        • Mills G.B.
        • McConkey D.J.
        • Network T.R.
        • Weinstein J.N.
        • Kwiatkowski D.J.
        • Lerner S.P.
        Comprehensive molecular characterization of muscle-invasive bladder cancer.
        Cell. 2017; 171: 540-556.e525
        • Damrauer J.S.
        • Hoadley K.A.
        • Chism D.D.
        • Fan C.
        • Tiganelli C.J.
        • Wobker S.E.
        • Yeh J.J.
        • Milowsky M.I.
        • Iyer G.
        • Parker J.S.
        • Kim W.Y.
        Intrinsic subtypes of high-grade bladder cancer reflect the hallmarks of breast cancer biology.
        Proc. Natl. Acad. Sci. U.S.A. 2014; 111: 3110-3115
        • Choi W.
        • Porten S.
        • Kim S.
        • Willis D.
        • Plimack E.R.
        • Hoffman-Censits J.
        • Roth B.
        • Cheng T.
        • Tran M.
        • Lee I.L.
        • Melquist J.
        • Bondaruk J.
        • Majewski T.
        • Zhang S.
        • Pretzsch S.
        • Baggerly K.
        • Siefker-Radtke A.
        • Czerniak B.
        • Dinney C.P.
        • McConkey D.J.
        Identification of distinct basal and luminal subtypes of muscle-invasive bladder cancer with different sensitivities to frontline chemotherapy.
        Cancer Cell. 2014; 25: 152-165
        • Lindgren D.
        • Frigyesi A.
        • Gudjonsson S.
        • Sjodahl G.
        • Hallden C.
        • Chebil G.
        • Veerla S.
        • Ryden T.
        • Mansson W.
        • Liedberg F.
        • Hoglund M.
        Combined gene expression and genomic profiling define two intrinsic molecular subtypes of urothelial carcinoma and gene signatures for molecular grading and outcome.
        Cancer Res. 2010; 70: 3463-3472
        • Biton A.
        • Bernard-Pierrot I.
        • Lou Y.
        • Krucker C.
        • Chapeaublanc E.
        • Rubio-Perez C.
        • Lopez-Bigas N.
        • Kamoun A.
        • Neuzillet Y.
        • Gestraud P.
        • Grieco L.
        • Rebouissou S.
        • de Reynies A.
        • Benhamou S.
        • Lebret T.
        • Southgate J.
        • Barillot E.
        • Allory Y.
        • Zinovyev A.
        • Radvanyi F.
        Independent component analysis uncovers the landscape of the bladder tumor transcriptome and reveals insights into luminal and basal subtypes.
        Cell Rep. 2014; 9: 1235-1245
        • Chang W.-C.
        On using principal components before separating a mixture of two multivariate normal distributions.
        J. Roy. Statistical Soc. 1983; 32: 267-275
        • Alter O.
        • Brown P.O.
        • Botstein D.
        Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms.
        Proc. Natl. Acad. Sci. U.S.A. 2003; 100: 3351-3356
        • Meng C.
        • Helm D.
        • Frejno M.
        • Kuster B.
        moCluster: Identifying joint patterns across multiple omics data sets.
        J. Proteome Res. 2016; 15: 755-765
        • Rappoport N.
        • Shamir R.
        Multi-omic and multi-view clustering algorithms: review and cancer benchmark.
        Nucleic Acids Res. 2018; 46: 10546-10562
        • Chauvel C.
        • Novoloaca A.
        • Veyre P.
        • Reynier F.
        • Becker J.
        Evaluation of integrative clustering methods for the analysis of multi-omics data.
        Briefings Bioinformatics. 2019; (pii: [Epub ahead of print]): bbz015
        • Hastie T.
        • Tibshirani R.
        • Eisen M.B.
        • Alizadeh A.
        • Levy R.
        • Staudt L.
        • Chan W.C.
        • Botstein D.
        • Brown P.
        'Gene shaving’ as a method for identifying distinct sets of genes with similar expression patterns.
        Genome Biol. 2000; 1
        • Holter N.S.
        • Mitra M.
        • Maritan A.
        • Cieplak M.
        • Banavar J.R.
        • Fedoroff N.V.
        Fundamental patterns underlying gene expression profiles: simplicity from complexity.
        Proc. Natl. Acad. Sci. U.S.A. 2000; 97: 8409-8414
        • Brazma A.A.C.C.
        Algorithms for gene expression analysis.
        Encyclopedia of Genetics, Genomics, Proteomics and Bioinformatics. John Wiley & Sons, New York City, NY2005
        • Franklin S.B.
        • Gibson D.J.
        • Robertson P.A.
        • Pohlmann J.T.
        • Fralish J.S.
        Parallel Analysis: a method for determining significant principal components.
        J. Vegetation Sci. 1995; 6: 99-106
        • Smilde A.K.
        • Kiers H.A.
        • Bijlsma S.
        • Rubingh C.M.
        • van Erk M.J.
        Matrix correlations for high-dimensional data: the modified RV-coefficient.
        Bioinformatics. 2009; 25: 401-405
        • Abdi H.
        • Williams L.J.
        • Valentin D.
        • Bennani-Dosse M.
        STATIS and DISTATIS: optimum multitable principal component analysis and three way metric multidimensional scaling.
        Wiley Interdisciplinary Reviews: Computational Statistics. 2012; 4: 124-167
        • van der Maaten L.
        • Hinton G.
        Visualizing Datausing t-SNE.
        Journal of Machine Learning Research. 2008; 9: 2579-2605
        • Mariette J.
        • Villa-Vialaneix N.
        Unsupervised multiple kernel learning for heterogeneous data integration.
        Bioinformatics. 2018; 34: 1009-1015
        • Krug K.
        • Mertins P.
        • Zhang B.
        • Hornbeck P.
        • Raju R.
        • Ahmad R.
        • Szucs M.
        • Mundt F.
        • Forestier D.
        • Jane-Valbuena J.
        • Keshishian H.
        • Gillette M.A.
        • Tamayo P.
        • Mesirov J.P.
        • Jaffe J.D.
        • Carr S.A.
        • Mani D.R.
        A curated resource for phosphosite-specific signature analysis.
        Mol. Cell. Proteomics. 2019; 18: 576-593