MOGSA: Integrative Single Sample Gene-set Analysis of Multiple Omics Data

Gene-set analysis (GSA) summarizes information from molecule to gene-set level to facilitate biological interpretation of molecular profiling experiments. We present a statistical framework for single sample GSA of multiple 'omic data. MOGSA learns the most variant biomolecules and integrates data sets to generate gene set scores (GSS) for each sample. It extracts the contribution of individual data sets and biomolecules to GSS. MOGSA is rigorously benchmarked with simulated and real biological data and is implemented in the Bioconductor package “mogsa.”


Input data and gene-set annotation matrix
The inputs to MOGSA are pairs of multiple matrices (XK, GK). XK, denotes a set of K matrices X1, ..., Xk, … XK, are input matrices of omics data, where rows are features (e.g. genes, proteins) and columns are observations (e.g. cell lines, disease tissue samples). Matrix Xk have pk rows and all omics data have the sample number of n columns. Each of the omics matrices X1, …, XK has a corresponding gene-set annotation matrix, G1, ..., Gk, …, GK. The gene-set annotation matrix Gk is a pk×m binary incidence matrix of gene to gene-set membership associations, where m is the number of gene-sets. The element gk [i,j] in Gk has the value 1 if the ith feature is a member of the gene-set j and 0 otherwise. Gk is constructed using predefined gene-set information such as the Gene Ontology [1, 2], GeneSigDb [3] or MSigDB [4].

MOGSA step 1 multivariate integration
The first step of the MOGSA involves data integration with a multiple table MF method. In this study, we use multiple factorial analysis (MFA) because of its simplicity and computational efficiency. MFA can be viewed as a generalization of principal component analysis (PCA) for a multi-table problem [5]. The first step of MFA is to normalize each individual data set so that variance of their first principal components has the same. Next, the normalized individual matrices are concatenate to a grand matrix. Finally, a regular PCA is used to decompose the grand matrix to derive components that represent the most prominent structure in multiple input matrices. We describe MFA using the nomenclature used in [5].
When integrating multiple data matrices, one must decide if all datasets should have equal weight, or if some data are "more important", for example those with higher quality, fewer features, higher variance, etc.
Simple tensor decomposition approaches, or PCA on a concatenated matrix, give every dataset equal weight and results are often dominated by the matrix (or matrices) with the large variance or most features. To correct for this, MFA weights datasets by dividing each by their first eigenvalue. The weight of each matrix is expressed as  is the first singular value of data matrix Xk. For convenience, the weights of matrices are stored in a diagonal matrix A, whose diagonal elements are The transpose of a matrix is denoted by superscript T . 1k T is a vector of 1 in the length of pk. As a result, A is a p×p diagonal matrix, the diagonal elements of A representing the weight of features in X1, ..., Xk.
Similarly, the weight of each observation is an n×n diagonal matrix, M. In the present study, we use mii=1/n, namely, all observations are equally weighted.
We then transpose and concatenate all Xk to a complete p×n matrix ( ): X is transpose so that P is an n×r matrix, Q is a p×r matrix, Δ is an r×r square matrix. The maximum number of r is the rank of X. The matrix storing components of MFA, F, are given by where F has the same dimension as P. In the PCA framework, the matrix P contains the PCs or latent variables. We also call it sample space in this paper because every row in P corresponds to a sample in X.
The matrix Q is the loading matrix or feature space as every row in P corresponds to a feature. Because X is a concatenation of multiple matrices, the feature space matrices Q is also a concatenation of multiple Qk matrices, namely,

MOGSA step 2 project gene-set annotation matrix as supplementary data
Different gene-sets have different candidate genes, therefore, in order to facilitate the comparison of geneset score across gene-sets, we normalized the gene-set annotation matrix so that the sum of each column in where ] , [ j i g is the elements on the ith row and jth column in the normalized gene-set annotation matrix Ĝ The gene-set score can be calculated using either unnormalized or normalized gene-set annotation, but we will use the normalized version to describe the method.
Next, we project the annotation matrix as supplementary data to generate the gene-set space matrix Wk (m×r) [6], which is calculated as a product of the normalized gene annotation matrix and loading matrix.
The overall gene-set space W (m×r matrix) could also be expressed as the sum of individual ̂, that is,

MOGSA step 3 reconstruction of gene-set-observation matrix
The main output of MOGSA is a gene-set score (GSS) matrix, denoted by Y, whose rows are m gene-sets and columns are n observations. It is calculated as where Q [R] and P [R] are the gene space and observation space within top R components.
] [ R Δ is the diagonal matrix containing top R singular values. As a result, X [R] is the reconstruction of X using top R components.
In practice, it is interesting to know which dataset or component contribute more to the overall gene-set score. Therefore, we decompose gene-set scores with respect to data sets and components. The GSS matrix for dataset Xk and component r is calculated as In practice, only the components with greatest variances (highest eigenvalues) should be retained in the analysis. If all components are retained, the result would be similar or exactly the same as naïve matrix multiplication (NMM; see later).

Evaluation of the significance of gene-set scores (calculating p-values)
The p-value associated with each GSS could be calculated used central limited theorem (CLT). The expression (7) and (10) say that, for each observation, a gene-set score could be viewed as the weighted mean of gene expression (in the reconstructed expression values X [R] ) of genes in a particular gene-set.
If the candidate genes in a gene-set are randomly drawn from all features in X [R] (null hypothesis), the distribution of the means of selected genes given by CTL is, x  is the sampling standard deviation of means,  is the standard deviation of the column in X [R] , h is the number of candidate genes mapped to X in a geneset and is the finite population correction factor (p is the number of features in X). The finite population correction factor is used as each gene was only selected once in one gene-set. Of note, CLT only states that the mean of "selected genes" follows a normal distribution but does not rely on a normality assumption in input data sets. Therefore, it can be used with categorical or count data, where the categorical values are normalized as in correspondence analysis, resulting in a chi-square distribution [7].

Gene influential score
Gene-sets are composed of genes, therefore, it is also important to evaluate the contribution of each feature to the GSS. The genes with large contribution could be view as "driver" genes in a gene-set. In MOGSA, feature contribution, denoted by gene influential score (GIS), is calculated via a leave-one-out procedure.
The GSS of gene-set i, , for all the observations are [ i G is the gene-set annotation vector for gene-set i. Correspondingly, the gene-set score for ith gene- G is the gene-set annotation vector for gene-set i but without gene g. The influence of the gene g is measured by ) ( stands for the function of calculating standard deviation. For convenience, the feature influential score then is rescaled, such that the gene with maximum influence always equals 1. In general, a positive whereas a gene with a negative value tends to have a negative correlation.

Other GSA methods and Naïve Matrix multiplication (NMM)
Single gene-set method, including GSVA and ssGSEA methods were implemented using the R/Bioconductor package GSVA [8]. Default settings were used for these methods. Naïve gene-set score Ynaive was calculated through matrix multiplication (NMM).
Therefore, the result of NMM is exactly the same as MOGSA if all of the axes are retained.

Stability analysis of MOGSA components
The stability of MOGSA components was evaluated based on the NCI-60 and BLCA datasets in a sampleand feature-wise fashion. In the sample-wise stability analysis of the NCI-60 dataset, we used a leave-oneout procedure. In each analysis, one cell line was excluded from the panel and MOGSA was applied to the reduced dataset. The resulting components were compared with the ones calculated from the complete dataset using the absolute value of Pearson correlation coefficients. The resulting correlation matrix can be found in Table S1.
We also observed that the highly correlated components may be ordered differently when excluding a cell line. For example, when the ovarian cell line OVCAR-4 was excluded, component 6 and 7 were swapped ( Figure S17). In addition, a high score of a sample in a component does not mean the component calculated from excluding that sample will have a worse correlation with original component ( Figure S17).
The feature-wise stability analysis was conducted by applying MOGSA on feature-wise reduced matrix. In these analyses, we randomly excluded 10%, 20%, 30%, 40% and 50% of all features in both transcriptomic and proteomic datasets. The resulting components were compared with the ones calculated from the complete dataset. The correlation coefficients were not lower than 0.94 until component 14, which confirmed the components calculated based on less features are extremely stable.
In the sample-wise stability analysis of the BLCA dataset, all 308 patients were divided into 22 groups (14 in each) and one group was excluded each time. In the feature-wise analysis, 10% to 50% of features from both datasets were excluded. The components resulting from the sample-or feature-wise reduced datasets were evaluated in the same way as for the NCI-60 dataset. We observed that the top 5 components do not change much (absolute value of correlation coefficient > 0.99) upon exclusion of samples or features. Figure S1 -A diagram showing the data simulation data. One dataset contains a matrix triplet (data 1, data 2 and data 3). Each triplet contains 1,000 features and 30 observations. The 30 observations were divided into six clusters, 5 observations in each cluster. The 1,000 features are assigned to 20 gene-sets (each gene-set had 50 genes), coded in the gene set annotation matrix. 100 triplets were simulated in this analysis. Figure S2 -Determining the number of components that capture the correlated structure between NCI60 transcriptome and proteome data A random sampling method was used to determine the number of components capturing significant correlated structure and between transcriptome and proteome. To this end, the cell lines labels were randomly shuffled in both transcriptomic and proteomic data and the variance of components were calculated from the randomly labels data. We preformed this process 20 times in order to estimate the null distribution of variances associated with each component and found that the variance of top three components are significantly higher than the null distribution. In Figure S12, we show that component 1 was significantly correlated with cell doubling time.

Figure S3 -Distribution of iPS ES datasets before and after filtering
In mRNA data set, most of the genes with RPKM values were removed, resulting in a distribution closer to a normal distribution. Whereas filtering protein and phospho-site data almost have not changed the distribution, because only the low intensity proteins that exclusively detected in a small number of samples were excluded. Left column: distribution before filtering. Right column: distribution after filtering.

Figure S5 -Distribution of BLCA datasets before and after filtering
In CNV dataset, the sharp peak centered as 0 indicates most of genes have very small copy number changes. Filtering out gene has low median absolute deviations (MADs) results in a distribution having lower density in the center, i.e. less genes with unchanged CNV. The low abundant mRNAs were filtered. Left column: distribution before filtering. Right column: distribution after filtering.

Figure S6 -PCA of (A) CNV and (B) transcriptomic data of BLCA tumors (n=308)
Each panel shows a scree plot of the variance captured by the first 10 components and a plot of the first two components (PC1, PC2). Tumors are colored by molecular subtype; C1 (red), C2 (green) and C3 (blue). The first two components of the CNV decomposition distinguishes these 3 subtypes. The first eigenvalue (square of singular value) of transcriptomic and CNV data were 0.0004 and 0.0003 respectively, which values were used to calculate the weight of each dataset in MFA.

Figure S7 -Five components identifies robust top-ranking gene sets
Gene set scores (GSS) were calculated when 1 to 12 components were included. In every cases (1-12 components), gene sets were ranked from high to low according to the number of patients in which their GSSs was significant. Next, top N (N=10, 20, 40, 100

Figure S10 -Determining the number of K in KNN algorithm (Use in calculating prediction Strength)
Cross-validation were used to optimize the optimal number of K in the KNN classifier. We evaluated odd numbers K from 1 to 17. The performance of classifier were measured with prediction accuracy (y-axis). There is not a K clearly better than the others.

Figure S11 -Prediction strength using different K in KNN classifier.
All K suggest that three subtype is the robust number of subtype in the integrated BLCA datasets.         The rows (gene sets) of the heatmaps are clustered so that the gene sets with similar GSS scores across patients are grouped. Columns are ordered according to BLCA tumor molecular subtype (C1, C2 and C3). Gene sets formed three broad clusters (those significant in C1, C1 and C2 or C3 and other tumors). Significant gene sets in C1 were associated with apoptosis, G protein coupled proteins, extracellular function, muscle development and Immune response. Gene sets significant in both C1 and C2 were mostly associated with the cell cycle, DNA repair and replication. Gene sets significant in C3 patients were associated with the mitochondria.

Figure S22-BLCA subtype C2 has more instability and higher numbers of mutation events
The figure shows the numbers of homozygous or heterozygous deletions, low and high level gains in addition to total CNV events in the genome of BLCA patients (n=308).

Figure S23 -The distribution of gene set scores in different BLCA molecular subtypes
Boxplot of gene sets scores in figure 5 C and D. Immune processes, Regulation of apoptosis, and cytosketal gene sets were upregulated in C1. C2 was characterized by downregulation of ETS1 and IRF targets, Gprotein coupled receptors pathways and increased DNA related pathway (possibly associated with increased genome instability). C3 had lower expression of cell cycle and DNA replication genes compared to C1 and C2.    Figure 5C) Plots are labelled with the gene set name and the number features (genes) in each gene set which is shown in parenthesis. The raw GSS is sum of the contributions of genes in a gene set. Therefore, the scale of nonnormalized gene set scores (y-axis) are different. Gene sets with more genes will have higher scores so that GSS will not be comparable within a study. To solve this problem, MOGSA normalizes raw GSS by gene set size. Normalized GSS are reported throughout this article.