Contents

Introduction

This tutorial is a guide on how to run and use the COSMONET R package. It is available from http://bioinfo.na.iac.cnr.it/cosmonet/.

COSMONET depends on different R packages such as APML0, survival, ggpurb, igraph, pheamap.

Installation

COSMONET can be installed through the following commands in the R console.

# ## From the command line
# ### 1. Open a terminal window
# ### 2. Go to the directory that contains your package directory.
# ### 3. Install the package
# R CMD INSTALL COSMONET_0.1.tar.gz
#
# ## From local source file
# ### Go to the link http://bioinfo.na.iac.cnr.it/cosmonet/ and download the package from the Repository - R package
# install.packages("COSMONET_0.1.0.tar.gz", repos = NULL, type="source")

Using COSMONET to investigate cancer data

The aim of this section is to provide the users with a general view of the package, including the main functions, and to describe the package’s outputs. First, we upload the COSMONET R package.

library(COSMONET)

For illustration purposes, we use two examples of cancer data that are included in our package. The data processed by COSMONET are microarray from GEO database and RNA-seq data from GDC data portal. The aim of the study is to identify the genes and pathways that play a crucial role in the severity of the disease related to the survival of cancer patients.

Users can download the normalized data and survival information directly from the Repository of the website (Repository/Data/Download Data).

Example 1

For microarray data, we consider expression measurements of two different breast cancer datasets from GEO database: * GSE2034 dataset as training set (Affymetrix Human Genome U133A Array); * GSE2990 dataset as testing set (Affymetrix Human Genome U133A Array).

RMA (Robust Multichip Average) function from affy R package for within arrays normalization and normalizeBetweenArrays function from limma R package for between arrays normalization are used for the normalization of each single dataset. The annotation is made by using biomart R package after removing redundant probe sets, i.e. probe sets that measure different regions of the same gene. Generally, the pre-processing of the data are done by the users. The summary data is showed in the table below:

Summary of the breast data used for our analysis
GEO Accession Platform Nr. of genes Nr. of samples Survival data
GSE2034 Affymetrix HG-U133A 13229 286 Relapse Free Survival (RFS)
GSE2990 Affymetrix HG-U133A 13229 187 Relapse Free Survival (RFS)

In this study, the survival information is relapse-free survival, i.e., the length of time after primary treatment for a cancer ends that the patient survives without any signs or symptoms of that cancer. In other words, measure the relapse-free survival means to see how well a new treatment works. Also called DFS disease-free survival, and RFS.

To investigate the datasets, we first load the data in an R session and then, we perform the normalization between the training and testing set by using NormalizeBetweenData() function.

# Load the normalized data
load("data/BreastData_Normalized.RData")

# Load survival information: RFS - time (months) and status (alive=0 or death=1)
load("data/BreastData_Survival.RData")

Functional Linkage Matrix and Co-Expression Matrix

To incorporate the a-priori biological information in the penalized regression model, we need to create the network matrix. In our analysis, to construct the molecular graph, we use CreateNetworkMatrix() function. The disease code associated with breast cancer is (DOID:1612.

# Load the tissue specific data
library(tictoc)

tic("Run time")
load("data/BreastData_TissueSpecificEdge.rda")
NetworkMatrix <- CreateNetworkMatrix(
  x1.norm,
  repositoryDisease,
  diseaseID = "DOID:1612",
  topGenes = 500,
  tissueSpecificEdge,
  output = TRUE,
  message = TRUE
)
## [1] "Your list contains 500 genes. 500 out of 500 are used in the network."
## [1] "There are 43865 edges left."
toc()
## Run time: 297.62 sec elapsed
# Network matrix
Omega <- as.matrix(NetworkMatrix$Omega)
dim(Omega)
## [1] 13229 13229

Variable screenings for Cox model

In this section we present the application of the three screening techniques. We use the function ScreeningMethod() to identify the screened genes.

  1. BioMedical Driven (BMD)-screening
# Selection of the BMD-screened genes
cancer.genes <- NetworkMatrix$DiseaseGenes[,2]
screenVars.BMD <- ScreeningMethod(
  x1.norm,
  cancer.genes,
  screening="bmd",
  ranking=NULL,
  thresh=NULL
)
## [1] "Number of BMD-screened genes: 437"
  1. DAta-driven (DAD-screening)
# Ranking of genes that are highly correlated with the patient's survival
load("data/BreastData_RankingGenes.RData")
head(rankingGenes)
##   original.index  symbol marg.coeff       beta        pvals
## 1           2449  CHRNA9  10.437125 -10.437125 2.030443e-02
## 2           3012   CXCL6   9.836375  -9.836375 2.122943e-02
## 3           9444 RACGAP1   8.915556   8.915556 1.484894e-07
## 4           7834   NR2E3   8.896399  -8.896399 8.931928e-05
## 5           2494   CKAP2   8.294733   8.294733 5.502541e-07
## 6           7414    MZB1   8.282109  -8.282109 2.372354e-04

The ranking of genes is obtained applying MarginalCoxRanking() function. Users can download the file of the ranked genes directly from the website (Repository/Data/Download Data).

# Plots
par(mfrow = c(1, 2))
plot(rankingGenes$marg.coeff, rankingGenes$pvals, col = "blue", pch = 20, cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5)
# Add the threshold line related to the 50th marginal coefficient (equal to 6.851106)
abline(v=rankingGenes[floor(dim(x1.norm)[1]/log(dim(x1.norm)[1])),3], col="red",lwd=2, lty=2)
plot(rankingGenes$pvals, 1:dim(x1.norm)[2], col = "blue", pch = 20, cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5)
# Add the threshold line equal to 50 (i.e., the first 50 genes)
abline(h=floor(dim(x1.norm)[1]/log(dim(x1.norm)[1])), col="red",lwd=2, lty=2)
Plots based on the $p$-values and marginal coefficients of the ranked cancer genes.

Figure 1: Plots based on the \(p\)-values and marginal coefficients of the ranked cancer genes

These plots give an idea of the number of screened genes that can be involved in the analysis to give significance to the results (in the first plot the significant genes are positioned to the right of the dashed red line; in the second one the genes are are below the dashed red line). In particular, we set the threshold equal to \(\lfloor n/ log n \rfloor=50\), i.e., the 50 first genes as DAD-screened genes.

# Selection of the DAD-screened genes
screenVars.DAD <- ScreeningMethod(
  x1.norm,
  cancer.genes,
  screening="dad",
  ranking=rankingGenes,
  thresh=floor(dim(x1.norm)[1]/log(dim(x1.norm)[1]))
)
## [1] "Number of DAD-screened genes: 50"
  1. BioMedical Driven + DAta Driven Screening (BMD+DAD-screening)
# Union of BMD- and DAD-screeened genes
screenVars.BMD.DAD <- ScreeningMethod(
  x1.norm,
  cancer.genes,
  screening="bmd+dad",
  ranking=rankingGenes,
  thresh=floor(dim(x1.norm)[1]/log(dim(x1.norm)[1]))
)
## [1] "Number of BMD+DAD-screened genes: 549"

We consider as resulting set the union of the BMD and DAD-screened genes.

Network-Regularized Cox Regression

Now we are able to apply the penalized methods using the subset of screened variables in order to identify the active regression coefficients and the optimal cutoff (based on the prognostic index) on the training set. We can get reproducible results by fixing \(alpha\) (network influence) and the elements of each folds in the 5-fold cross validation. We perform the complete procedure applying CosmonetTraining() and SelectOptimalCutoff() functions.

# Set alpha and lambda parameters
alpha <- 0.5
# lambda <- 10^(seq(0,-2,-0.05))
# For reproducibility of the results
library(caret)
set.seed(2021)
flds <- createFolds(y1$time, k = 5, list = TRUE, returnTrain = FALSE)
foldids <- rep(1,dim(y1)[1])
foldids[flds$Fold2] = 2
foldids[flds$Fold3] = 3
foldids[flds$Fold4] = 4
foldids[flds$Fold5] = 5
# BMD-screening + variable selection
# pdf(file="results/Train_KM_BMD_median.pdf",width=15, height=7)
tic("Run time")
fitTraining.BMD <- CosmonetTraining(
  10,
  x1.norm,
  y1,
  screenVars.BMD,
  family="Cox",
  penalty="Net",
  Omega=Omega,
  alpha=alpha,
  nlambda = 50,
  foldid = foldids,
  selOptLambda="min",
  optCutpoint="median"
)
Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot.

Figure 2: Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot

Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot.

Figure 3: Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot

toc()
## Run time: 11.303 sec elapsed
# dev.off()
# Regression coefficients
beta.BMD <- fitTraining.BMD$beta
# Number of non-zero beta
sum(beta.BMD!=0)
## [1] 40
# write.table(beta.BMD, file = "results/beta_BMD_breast.txt", row.names = T, col.names = F)

# Optimal parameters
pars.BMD <- data.frame(alpha=0.5,opt.lambda=fitTraining.BMD$opt.lambda,opt.cutoff=fitTraining.BMD$opt.cutoff)
pars.BMD
##   alpha opt.lambda opt.cutoff
## 1   0.5    0.11899    -7.3697
# Summary tables
df.train.BMD <- fitTraining.BMD$df
head(df.train.BMD)
##     sample        PI time status 50%
## 1 GSM37048 -8.358512   87      0   2
## 2 GSM36945 -8.352637   92      0   2
## 3 GSM36984 -8.342822  138      0   2
## 4 GSM36852 -8.267668  107      0   2
## 5 GSM36883 -8.254200   88      0   2
## 6 GSM36845 -8.223623   54      0   2
summary.BMD <- fitTraining.BMD$summary
summary.BMD
##     Low.Risk High.Risk  cutoff  p.value
## 50%      143       143 -7.3697 2.82e-14
# write.table(summary.BMD, file = "results/summary_BMD_breast.txt", row.names = T, col.names = T)
# DAD-screening + variable selection
# pdf(file="results/Train_KM_DAD_survCutpoint.pdf",width=15, height=7)
tic("Run time")
fitTraining.DAD <- CosmonetTraining(
  10,
  x1.norm,
  y1,
  screenVars.DAD,
  family="Cox",
  penalty="Net",
  Omega=Omega,
  alpha=alpha,
  nlambda = 50,
  foldid = foldids,
  selOptLambda="min",
  optCutpoint="survCutpoint"
)
Training Phase (DAD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot.

Figure 4: Training Phase (DAD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot

Training Phase (DAD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot.

Figure 5: Training Phase (DAD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot

toc()
## Run time: 1.624 sec elapsed
# dev.off()
# Regression coefficients
beta.DAD <- fitTraining.DAD$beta
# Number of non-zero beta
sum(beta.DAD!=0)
## [1] 32
# write.table(beta.DAD, file = "results/beta_DAD_breast.txt", row.names = T, col.names = F)

# Optimal parameters
pars.DAD <- data.frame(alpha=0.5,opt.lambda=fitTraining.DAD$opt.lambda,opt.cutoff=fitTraining.DAD$opt.cutoff)
pars.DAD
##   alpha opt.lambda opt.cutoff
## 1   0.5 0.04740321  -10.15184
# Summary tables
df.train.DAD <- fitTraining.DAD$df
head(df.train.DAD)
##     sample        PI time status groupRisk id
## 1 GSM36933 -13.83397  110      0         2  1
## 2 GSM36817 -13.68776  113      0         2  2
## 3 GSM36851 -13.37413  100      0         2  3
## 4 GSM36915 -13.31251   97      0         2  4
## 5 GSM36863 -13.23986   98      0         2  5
## 6 GSM36794 -13.04705   87      0         2  6
summary.DAD <- fitTraining.DAD$summary
summary.DAD
##     cutpoint statistic
## PI -10.15184  10.04115
# write.table(summary.DAD, file = "results/summary_DAD_breast.txt", row.names = T, col.names = T)
# BMD+DAD-screening + variable selection
# pdf(file="results/Train_KM_BMD_DAD_median.pdf",width=15, height=7)
tic("Run time")
fitTraining.BMD.DAD <- CosmonetTraining(
  10,
  x1.norm,
  y1,
  screenVars.BMD.DAD,
  family="Cox",
  penalty="Net",
  Omega=Omega,
  alpha=alpha,
  nlambda = 50,
  foldid = foldids,
  selOptLambda="min",
  optCutpoint="median"
)
Training Phase (BMD+DAD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot.

Figure 6: Training Phase (BMD+DAD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot

Training Phase (BMD+DAD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot.

Figure 7: Training Phase (BMD+DAD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot

toc()
## Run time: 10.647 sec elapsed
# dev.off()
# Regression coefficients
beta.BMD.DAD <- fitTraining.BMD.DAD$beta
sum(beta.BMD.DAD!=0)
## [1] 67
# write.table(beta.BMD.DAD, file = "results/beta_BMD_DAD_breast.txt", row.names = T, col.names = F)

# Optimal parameters
pars.BMD.DAD <- data.frame(alpha=0.5,opt.lambda=fitTraining.BMD.DAD$opt.lambda,opt.cutoff=fitTraining.BMD.DAD$opt.cutoff)
pars.BMD.DAD
##   alpha opt.lambda opt.cutoff
## 1   0.5 0.08331165  -23.89154
# Summary tables
df.train.BMD.DAD <- fitTraining.BMD.DAD$df
head(df.train.BMD.DAD)
##     sample        PI time status 50%
## 1 GSM36868 -26.13637  123      0   2
## 2 GSM36863 -26.03310   98      0   2
## 3 GSM37059 -25.95335   83      0   2
## 4 GSM36984 -25.90174  138      0   2
## 5 GSM36780 -25.86187   84      0   2
## 6 GSM36821 -25.83046   88      0   2
summary.BMD.DAD <- fitTraining.BMD.DAD$summary
summary.BMD.DAD
##     Low.Risk High.Risk    cutoff p.value
## 50%      143       143 -23.89154       0
# write.table(summary.BMD.DAD, file = "results/summary_BMD_DAD_breast.txt", row.names = T, col.names = T)

Survival analysis

We perform the testing phase by using CosmonetTesting() and ValidationTest() functions to validate the survival model.

# Validation test for BMD-screening + variable selection
opt.cutoff.BMD <- fitTraining.BMD$opt.cutoff
# pdf(file="results/Test_KM_BMD_breast.pdf",width=15, height=7)
tic("Run time")
fitTesting.BMD <- CosmonetTesting(
  x2.norm,
  y2,
  screenVars.BMD,
  beta.BMD,
  opt.cutoff.BMD
)
Testing Phese (BMD-screening): (A) Kaplan-Maier curves and risk table and (B) distribution of the risk score PI.

Figure 8: Testing Phese (BMD-screening): (A) Kaplan-Maier curves and risk table and (B) distribution of the risk score PI

toc()
## Run time: 0.812 sec elapsed
# dev.off()

# Summary table
df.test.BMD <- fitTesting.BMD$df
head(df.test.BMD)
##       sample        PI       time status groupRisk
## 44  GSM65361 -8.539579  88.273973      0         2
## 182 GSM65875 -8.457235 107.243836      0         2
## 169 GSM65862 -8.424892 142.323288      0         2
## 41  GSM65358 -8.384166  66.969863      0         2
## 25  GSM65342 -8.350312   2.169863      0         2
## 185 GSM65878 -8.348892   8.810959      1         2
# Validation test for DAD-screening + variable selection
opt.cutoff.DAD <- fitTraining.DAD$opt.cutoff
# pdf(file="results/Test_KM_DAD_breast.pdf",width=15, height=7)
tic("Run time")
fitTesting.DAD <- CosmonetTesting(
  x2.norm,
  y2,
  screenVars.DAD,
  beta.DAD,
  opt.cutoff.DAD
)
Testing Phase (DAD-screening): (A) Kaplan-Maier curves and risk table and (B) distribution of the risk score PI.

Figure 9: Testing Phase (DAD-screening): (A) Kaplan-Maier curves and risk table and (B) distribution of the risk score PI

toc()
## Run time: 0.79 sec elapsed
# dev.off()

df.test.DAD <- fitTesting.DAD$df
head(df.test.DAD)
##       sample        PI      time status groupRisk
## 102 GSM65794 -12.51275 103.95616      0         2
## 107 GSM65799 -12.40898  87.94521      0         2
## 172 GSM65865 -12.40538 144.72329      0         2
## 150 GSM65843 -12.29438  84.23014      1         2
## 14  GSM65331 -11.85027 114.01644      0         2
## 178 GSM65871 -11.82383 104.84384      0         2
# Validation test for BMD+DAD-screening + variable selection
opt.cutoff.BMD.DAD <- fitTraining.BMD.DAD$opt.cutoff
# pdf(file="results/Test_KM_BMD_DAD_breast.pdf",width=15, height=7)
tic("Run time")
fitTesting.BMD.DAD <- CosmonetTesting(
  x2.norm,
  y2,
  screenVars.BMD.DAD,
  beta.BMD.DAD,
  opt.cutoff.BMD.DAD
)
Testing Phase (BMD+DAD-screening): (A) Kaplan-Maier curves and risk table and (B) distribution of the risk score PI.

Figure 10: Testing Phase (BMD+DAD-screening): (A) Kaplan-Maier curves and risk table and (B) distribution of the risk score PI

toc()
## Run time: 0.775 sec elapsed
# dev.off()

df.test.BMD.DAD <- fitTesting.BMD.DAD$df
head(df.test.BMD.DAD)
##       sample        PI     time status groupRisk
## 169 GSM65862 -26.11466 142.3233      0         2
## 172 GSM65865 -25.99728 144.7233      0         2
## 178 GSM65871 -25.71374 104.8438      0         2
## 171 GSM65864 -25.70033 120.5260      1         2
## 29  GSM65346 -25.55507  18.6411      0         2
## 161 GSM65854 -25.54970 147.1890      0         2

Venn diagram plotting

We use venn diagrams to show the relation between low-risk groups (or high-risk groups) of each screening method by considering the testing set. We use VennPlot() function.

# Low-risk groups
lowRisk.BMD <- as.character(df.test.BMD$sample[which(df.test.BMD$groupRisk==2)])
lowRisk.DAD <- as.character(df.test.DAD$sample[which(df.test.DAD$groupRisk==2)])
lowRisk.BMD.DAD <- as.character(df.test.BMD.DAD$sample[which(df.test.BMD.DAD$groupRisk==2)])

# High-risk groups
highRisk.BMD <- as.character(df.test.BMD$sample[which(df.test.BMD$groupRisk==1)])
highRisk.DAD <- as.character(df.test.DAD$sample[which(df.test.DAD$groupRisk==1)])
highRisk.BMD.DAD <- as.character(df.test.BMD.DAD$sample[which(df.test.BMD.DAD$groupRisk==1)])
# Display Venn Diagram
vennPlot1 <- VennPlot(lowRisk.BMD, lowRisk.DAD, lowRisk.BMD.DAD, "LR-Group")
Venn diagramm of low-risk groups on testing set

Figure 11: Venn diagramm of low-risk groups on testing set

## NULL
# Display Venn Diagram
vennPlot2 <- VennPlot(highRisk.BMD, highRisk.DAD, highRisk.BMD.DAD, "HR-Group")
Venn diagramm of high-risk groups on testing set

Figure 12: Venn diagramm of high-risk groups on testing set

## NULL

Heat-map visualization

We use HeatmapSurvGroup() function to visualize the heat-map of the gene expression matrix on the training and testing set, respectively, by taking into account the screened genes and the prognostic information (high- and low-risk group).

Heat-maps of the BMD-genes on training and testing set

Figure 13: Heat-maps of the BMD-genes on training and testing set

Heat-maps of the DAD-genes on training and testing set

Figure 14: Heat-maps of the DAD-genes on training and testing set

 Heat-maps of the BMD+DAD-genes on training and testing set

Figure 15: Heat-maps of the BMD+DAD-genes on training and testing set

Figure 13-15 show the Z-score expression matrix of the BMD-, DAD- and BMD+DAD-genes in the training and testing, respectively. The genes in the horizontal direction are clustered in the same order in both sets. Z-scores of the gene expression measurements are used. Z-scores larger than 3.5 were set to 3.5 and Z-scores smaller than -3.5 were set to -3.5. Dark pink color indicates a high level of expression in breast cancer and green color indicates a low level of expression. The patients are divided in high-risk and low-risk groups (i.e., patients with bad and good prognosis).

Correlation between PIs

We perform a correlation analysis on the training and testing set to further investigate the relation between prognostic indices PIs. We use corrPlot() function.

# Renames columns
colnames(df.train.BMD)[5] <- "groupRisk.BMD"
colnames(df.train.DAD)[5] <- "groupRisk.DAD"
colnames(df.train.BMD.DAD)[5] <- "groupRisk.BMD.DAD"

# Correlation plots of prognostic index PI
corrPlot1 <- CorrPlot(df1 = df.train.BMD,
                      df2 = df.train.DAD,
                      groupRisk1 = "groupRisk.BMD",
                      groupRisk2 = "groupRisk.DAD",
                      method="spearman",
                      name1 = "BMD",
                      name2 = "DAD",
                      set = "Training set"
                      )
corrPlot2 <- CorrPlot(df1 = df.train.BMD,
                      df2 = df.train.BMD.DAD,
                      groupRisk1 = "groupRisk.BMD",
                      groupRisk2 = "groupRisk.BMD.DAD",
                      method="spearman",
                      name1 = "BMD",
                      name2 = "BMD.DAD",
                      set = "Training set"
                      )

# Visualization
ggarrange(corrPlot1,corrPlot2, nrow = 1, ncol = 2, common.legend = T, font.label=list(color="black",size=14))
PI-correlation plots on training set

Figure 16: PI-correlation plots on training set

# ggsave(file="results/corr_train_breast.pdf",width=15, height=7)
# Rename columns
colnames(df.test.BMD)[5] <- "groupRisk.BMD"
colnames(df.test.DAD)[5] <- "groupRisk.DAD"
colnames(df.test.BMD.DAD)[5] <- "groupRisk.BMD.DAD"

# Correlation plots of prognostic index PI
corrPlot1 <- CorrPlot(df1 = df.test.BMD,
                      df2 = df.test.DAD,
                      groupRisk1 = "groupRisk.BMD",
                      groupRisk2 = "groupRisk.DAD",
                      method="spearman",
                      name1 = "BMD",
                      name2 = "DAD",
                      set = "Testing set"
                      )
corrPlot2 <- CorrPlot(df1 = df.test.BMD,
                      df2 = df.test.BMD.DAD,
                      groupRisk1 = "groupRisk.BMD",
                      groupRisk2 = "groupRisk.BMD.DAD",
                      method="spearman",
                      name1 = "BMD",
                      name2 = "BMD.DAD",
                      set = "Testing set"
                      )

# Visualization
ggarrange(corrPlot1,corrPlot2, nrow = 1, ncol = 2, common.legend = T, font.label=list(color="black",size=14))
PI-correlation plots on testing set

Figure 17: PI-correlation plots on testing set

# ggsave(file="results/corr_test_breast.pdf",width=15, height=7)

Pathway Analysis

COSMONET includes an RMarkdown script to generate an interactive dashboard of KEGG pathways networks. We report the network created running the proposed model using the BMD+DAD-screened genes. The dashboard contains two tabs, one for the not-isolated genes and one for the full network. Each node corresponds to a gene and the edges represent the KEGG pathways shared by the linked genes. Different node colors are used to show a correlation measure between the genes and the disease under investigation as follows:

  • Red nodes represent the top 500 tissue-specific genes according to the HumanBase tool (mapped up genes);
  • Blue nodes represent the bottom 501 tissue-specific genes according to the HumanBase tool (mapped down genes);
  • White nodes represent the genes that have not been yet identified as tissue-specific in the HumanBase database (not mapped genes).

Therefore, red nodes represent the BMD contribution to the signature and blue and white nodes the DAD contribution, not yet captured in the BMD list. Note that some of the red genes might also be retrieved from the data-driven screening. However, in this context, we want to underline and make sense of the novel information not yet considered.

We implemented the GenerateNetwork() function to visualize the interactive networks. The colours of the nodes and the ID of the disease to investigate can be changed by the user passing the new values as function’s parameters in RUNCosmonetDashboard.Rmd loaded in the same directory as the vignetta.

# Build a data frame composed by two columns: gene names and beta coefficients
ListOfGenes <- data.frame(Symbol=rownames(beta.BMD.DAD), coefficients=beta.BMD.DAD[,1])
rownames(ListOfGenes) <-1:dim(ListOfGenes)[1]
# write.table(ListOfGenes, "ListOfGenes.txt", sep = "\t", row.names = F, col.names = T)

tic("Run time")
networkBreast <- GenerateNetwork(
  ListOfGenes = ListOfGenes,
  header = TRUE,
  diseaseID = "DOID:1612",
  nodesCols = c('#EFCFD4','#D9E3FC','#FFFFFF')
)
toc()

The function will automatically open a new html page to visualise the dashboard. An example of the final dashboard is shown in the figure below (and it can be accessed at the following link).

knitr::include_graphics("COSMONET_NETWORK.pdf")
Subnetwork related to the application of the BMD+DAD-screening and network method

Figure 18: Subnetwork related to the application of the BMD+DAD-screening and network method

The GenerateNetwork() function can also be used as a standalone tool to create a KEGG pathways network using an external list of genes and beta coefficients provided by the user.

Example 2

For RNASeq data we consider TCGA-LAUD (Lung adenocarcinoma) data from the GDC Data Portal. The data are first downloaded from LinkedOmics portal where they are already pre-processed and normalized (Illumina HiSeq platform, Gene-level, RPKM). Then, we split randomly the entire dataset between training and testing sets (\(50\%\) train and \(50\%\) test) in order to perform the between normalization. We use splitData() and NormalizeBetweenData() functions.

The summary of the data is showed in the table below:

Summary of the lung data used for our analysis
GDC Accession Platform Nr. of genes Nr. of samples Survival data
TCGA-LAUD Illumina HiSeq 19988 492 Overall Survival
TCGA-LAUD - train Illumina HiSeq 19988 246 Overall Survival
TCGA-LAUD - test Illumina HiSeq 19988 246 Overall Survival

Users can download the data directly from the website (Repository/Data/Download Data).

# Go to the directory that contains the data
# Load normalized data
load("data/LungData_Normalized.RData")

# Load survival information: RFS - time (months) and status (alive=0 or death=1)
load("data/LungData_Survival.RData")

Functional Linkage Matrix and Co-Expression Matrix

As before, to include the a-priori biological network in the model, we build the network matrix using CreateNetworkMatrix(). The disease code is DOID:1324.

# Load tissue specific data
tic("Run time")
load("data/LungData_TissueSpecificEdge.rda")
NetworkMatrix <- CreateNetworkMatrix(
  x1.norm,
  repositoryDisease,
  diseaseID = "DOID:1324",
  topGenes = 500,
  tissueSpecificEdge,
  output = TRUE,
  message = TRUE
)
## [1] "Your list contains 499 genes. 499 out of 499 are used in the network."
## [1] "There are 26310 edges left."
toc()
## Run time: 408.493 sec elapsed
# Network matrix
Omega <- as.matrix(NetworkMatrix$Omega)
dim(Omega)
## [1] 19988 19988

Variable screenings for Cox model

In this section, we illustrate the application of the variable screening approaches. We use ScreeningMethod() function.

  1. BioMedical Driven (BMD)-screening
# Selection of the BMD-screened genes
cancer.genes <- NetworkMatrix$DiseaseGenes[,2]
screenVars.BMD <- ScreeningMethod(
  x1.norm,
  cancer.genes,
  screening="bmd",
  ranking=NULL,
  thresh=NULL
)
## [1] "Number of BMD-screened genes: 482"
  1. DAta-driven (DAD-screening)
# Ranking of genes that are highly correlated with the patient's survival
load("data/LungData_RankingGenes.RData")
head(rankingGenes)
##   original.index    symbol marg.coeff       beta     pvals
## 1          11841     NOBOX  155.24804 -155.24804 0.9953518
## 2          12291    OR2A12   84.26860  -84.26860 0.9953640
## 3          12319     OR2J2   84.10697  -84.10697 0.9950270
## 4           2263 C20orf185   83.42011  -83.42011 0.9961920
## 5           9797 LOC285370   75.07351  -75.07351 0.9957503
## 6          12387    OR52B4   67.00438  -67.00438 0.9945741

Users can download the file of the ranked genes directly from the website (Repository/Data/Download Data).

# Plots
par(mfrow = c(1, 2))
plot(rankingGenes$marg.coeff, rankingGenes$pvals, col = "blue", pch = 20, cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5)
# Add the threshold line related to the 50th marginal coefficient (equal to 6.851106)
abline(v=rankingGenes[floor(dim(x1.norm)[1]/log(dim(x1.norm)[1])),3], col="red",lwd=2, lty=2)
index <- colnames(x1.norm)%in%rankingGenes$symbol
plot(rankingGenes$pvals[index], 1:dim(x1.norm[,index])[2], col = "blue", pch = 20, cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5)
# Add the threshold line equal to 50 (i.e., the first 50 genes)
abline(h=floor(dim(x1.norm[,index])[1]/log(dim(x1.norm[,index])[1])), col="red",lwd=2, lty=2)
Plots based on the $p$-values and marginal coefficients of the ranked cancer genes.

Figure 19: Plots based on the \(p\)-values and marginal coefficients of the ranked cancer genes

For this esxample, we set the threshold equal to 70. In other words, we consider the first 70 genes as DAD-screened genes.

# Selection of the DAD-screened genes
screenVars.DAD <- ScreeningMethod(
  x1.norm,
  cancer.genes,
  screening="dad",
  ranking=rankingGenes,
  thresh= 70 # floor(dim(x1.norm)[1]/log(dim(x1.norm)[1]))
)
## [1] "Number of DAD-screened genes: 70"
  1. BioMedical Driven + DAta Driven Screening (BMD+DAD-screening)
# Union of BMD- and DAD-screeened genes
screenVars.BMD.DAD <- ScreeningMethod(
  x1.norm,
  cancer.genes,
  screening="bmd+dad",
  ranking=rankingGenes,
  thresh= 70 # floor(dim(x1.norm)[1]/log(dim(x1.norm)[1]))
)
## [1] "Number of BMD+DAD-screened genes: 567"

We consider as resulting set the union of the BMD and DAD-screened genes.

Network-Regularized Cox Regression

In this section, we perform the variable selection setting \(\alpha=0.1\).

# Set alpha and lambda parameters
alpha <- 0.1
# Perform five-fold cross-validation to control the randomness for reproducibility of the results
library(caret)
set.seed(2021)
flds <- createFolds(y1$time, k = 5, list = TRUE, returnTrain = FALSE)
foldids <- rep(1,dim(y1)[1])
foldids[flds$Fold2] = 2
foldids[flds$Fold3] = 3
foldids[flds$Fold4] = 4
foldids[flds$Fold5] = 5
# BMD-screening + variable selection
# pdf(file="results/Train_KM_BMD_median.pdf",width=15, height=7)
tic("Run time")
fitTraining.BMD <- CosmonetTraining(
  10,
  x1.norm,
  y1,
  screenVars.BMD,
  family="Cox",
  penalty="Net",
  Omega=Omega,
  alpha=alpha,
  nlambda = 50,
  foldid = foldids,
  selOptLambda="min",
  optCutpoint="median"
)
Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot.

Figure 20: Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot

Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot.

Figure 21: Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot

toc()
## Run time: 6.928 sec elapsed
# dev.off()
# Regression coefficients
beta.BMD <- fitTraining.BMD$beta
# Number of non-zero beta
sum(beta.BMD!=0)
## [1] 25
# write.table(beta.BMD, "results1/beta_BMD_lung.txt", row.names = T, col.names = F)

# Optimal parameters
pars.BMD <- data.frame(alpha=0.1,opt.lambda=fitTraining.BMD$opt.lambda,opt.cutoff=fitTraining.BMD$opt.cutoff)
pars.BMD
##   alpha opt.lambda opt.cutoff
## 1   0.1  0.7717434   2.379353
# Summary tables
df.train.BMD <- fitTraining.BMD$df
head(df.train.BMD)
##         sample       PI time status 50%
## 1 TCGA.86.A4P8 1.804026  805      0   2
## 2 TCGA.75.6206 1.928616 2590      0   2
## 3 TCGA.S2.AA1A 1.935255  513      0   2
## 4 TCGA.97.8552 1.956918  626      0   2
## 5 TCGA.MP.A4TH 1.966103  741      0   2
## 6 TCGA.55.8206 1.966330  888      0   2
summary.BMD <- fitTraining.BMD$summary
summary.BMD
##     Low.Risk High.Risk   cutoff   p.value
## 50%      123       123 2.379353 0.0001381
# write.table(summary.BMD, "results1/summary_BMD_lung_median.txt", row.names = T, col.names = T)
# DAD-screening + variable selection
# pdf(file="results/Train_KM_DAD_survCutpoint.pdf",width=15, height=7)
tic("Run time")
fitTraining.DAD <- CosmonetTraining(
  10,
  x1.norm,
  y1,
  screenVars.DAD,
  family="Cox",
  penalty="Net",
  Omega=Omega,
  alpha=alpha,
  nlambda = 50,
  foldid = foldids,
  selOptLambda="min",
  optCutpoint="survCutpoint"
)
Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot.

Figure 22: Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot

Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot.

Figure 23: Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot

toc()
## Run time: 2.08 sec elapsed
# dev.off()
# Regression coefficients
beta.DAD <- fitTraining.DAD$beta
# Number of non-zero beta
sum(beta.DAD!=0)
## [1] 59
# write.table(beta.DAD, "results1/beta_DAD_lung.txt", row.names = T, col.names = F)

# Optimal parameters
pars.DAD <- data.frame(alpha=0.1,opt.lambda=fitTraining.DAD$opt.lambda,opt.cutoff=fitTraining.DAD$opt.cutoff)
pars.DAD
##   alpha opt.lambda opt.cutoff
## 1   0.1 0.04620194    5.65117
# Summary tables
df.train.DAD <- fitTraining.DAD$df
head(df.train.DAD)
##         sample         PI time status groupRisk id
## 1 TCGA.NJ.A4YQ -0.7283181 1432      0         2  1
## 2 TCGA.55.6987  0.4168007 2137      0         2  2
## 3 TCGA.78.7163  1.4052194 7248      0         2  3
## 4 TCGA.38.4626  1.6160155 3674      0         2  4
## 5 TCGA.55.6969  2.0784045 1239      0         2  5
## 6 TCGA.75.6206  2.2249807 2590      0         2  6
summary.DAD <- fitTraining.DAD$summary
summary.DAD
##    cutpoint statistic
## PI  5.65117  9.767933
# write.table(summary.DAD, "results1/summary_DAD_lung_survCutpoint.txt", row.names = T, col.names = T)
# (BMD+DAD-screening + variable selection
# pdf(file="results/Train_KM_BMD_DAD_minPValue.pdf",width=15, height=7)
tic("Run time")
fitTraining.BMD.DAD <- CosmonetTraining(
  10,
  x1.norm,
  y1,
  screenVars.BMD.DAD,
  family="Cox",
  penalty="Net",
  Omega=Omega,
  alpha=alpha,
  nlambda = 50,
  foldid = foldids,
  selOptLambda="min",
  optCutpoint="minPValue"
)
Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot.

Figure 24: Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot

Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot.

Figure 25: Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot

Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot.

Figure 26: Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot

Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot.

Figure 27: Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot

Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot.

Figure 28: Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot

Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot.

Figure 29: Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot

Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot.

Figure 30: Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot

Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot.

Figure 31: Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot

Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot.

Figure 32: Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot

Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot.

Figure 33: Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot

Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot.

Figure 34: Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot

Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot.

Figure 35: Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot

Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot.

Figure 36: Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot

Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot.

Figure 37: Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot

toc()
## Run time: 20.772 sec elapsed
# dev.off()
beta.BMD.DAD <- fitTraining.BMD.DAD$beta
sum(beta.BMD.DAD!=0)
## [1] 54
# write.table(beta.BMD.DAD, "results1/beta_BMD_DAD_lung.txt", row.names = T, col.names = F)

# Optimal parameters
pars.BMD.DAD <- data.frame(alpha=0.1,opt.lambda=fitTraining.BMD.DAD$opt.lambda,opt.cutoff=fitTraining.BMD.DAD$opt.cutoff)
pars.BMD.DAD
##   alpha opt.lambda opt.cutoff
## 1   0.1   0.531967   3.071756
# Summary tables
df.train.BMD.DAD <- fitTraining.BMD.DAD$df
head(df.train.BMD.DAD)
##         sample       PI time status 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75%
## 1 TCGA.55.8097 1.871829  476      0   2   2   2   2   2   2   2   2   2   2   2
## 2 TCGA.86.A4P8 1.883202  805      0   2   2   2   2   2   2   2   2   2   2   2
## 3 TCGA.55.8206 1.971239  888      0   2   2   2   2   2   2   2   2   2   2   2
## 4 TCGA.50.8457 1.993326 1125      0   2   2   2   2   2   2   2   2   2   2   2
## 5 TCGA.86.8056 2.016668  139      0   2   2   2   2   2   2   2   2   2   2   2
## 6 TCGA.MP.A4TH 2.018201  741      0   2   2   2   2   2   2   2   2   2   2   2
##   80%
## 1   2
## 2   2
## 3   2
## 4   2
## 5   2
## 6   2
summary.BMD.DAD <- fitTraining.BMD.DAD$summary
summary.BMD.DAD
##     Low.Risk High.Risk   cutoff   p.value
## 25%       62       184 2.559594 4.306e-08
## 30%       74       172 2.611505 2.410e-09
## 35%       86       160 2.657183 2.095e-10
## 40%       98       148 2.720584 4.211e-11
## 45%      111       135 2.776697 2.694e-10
## 50%      123       123 2.823742 5.474e-10
## 55%      135       111 2.883202 1.031e-11
## 60%      147        99 2.967608 4.441e-16
## 65%      160        86 3.034267 3.331e-15
## 70%      172        74 3.071756 1.110e-16
## 75%      184        62 3.169746 0.000e+00
## 80%      196        50 3.225793 2.220e-16
# write.table(summary.BMD.DAD, "results1/summary_BMD_DAD_lung_minPValue.txt", row.names = T, col.names = T)

Survival analysis

We perform the survival analysis using CosmonetTesting().

# Validation test for BMD-screening + variable selection
opt.cutoff.BMD <- fitTraining.BMD$opt.cutoff
# pdf(file="results/Test_KM_BMD_lung_median.pdf",width=15, height=7)
tic("Run time")
fitTesting.BMD <- CosmonetTesting(
  x2.norm,
  y2,
  screenVars.BMD,
  beta.BMD,
  opt.cutoff.BMD
)
Testing Phese (BMD-screening): (A) Kaplan-Maier curves and risk table and (B) distribution of the risk score PI.

Figure 38: Testing Phese (BMD-screening): (A) Kaplan-Maier curves and risk table and (B) distribution of the risk score PI

toc()
## Run time: 0.879 sec elapsed
# dev.off()

# Summary table
df.test.BMD <- fitTesting.BMD$df
head(df.test.BMD)
##           sample       PI time status groupRisk
## 29  TCGA.44.6148 1.792995  704      0         2
## 52  TCGA.49.AARR 1.883773 4992      0         2
## 64  TCGA.50.5942 1.885215 1847      0         2
## 228 TCGA.L4.A4E6 1.902645  435      0         2
## 192 TCGA.91.8497 1.908321  434      1         2
## 111 TCGA.55.8512 1.920400  607      1         2
# Validation test for DAD-screening + variable selection
opt.cutoff.DAD <- fitTraining.DAD$opt.cutoff
# pdf(file="results/Test_KM_DAD_lung_survCutpoint.pdf",width=15, height=7)
tic("Run time")
fitTesting.DAD <- CosmonetTesting(
  x2.norm,
  y2,
  screenVars.DAD,
  beta.DAD,
  opt.cutoff.DAD
)
Testing Phase (DAD-screening): (A) Kaplan-Maier curves and risk table and (B) distribution of the risk score PI.

Figure 39: Testing Phase (DAD-screening): (A) Kaplan-Maier curves and risk table and (B) distribution of the risk score PI

toc()
## Run time: 0.868 sec elapsed
# dev.off()

df.test.DAD <- fitTesting.DAD$df
head(df.test.DAD)
##           sample         PI time status groupRisk
## 65  TCGA.50.5946 -9.1809947 1617      0         2
## 115 TCGA.55.A490 -8.9224818   99      1         2
## 233 TCGA.L9.A7SV -4.8845769  565      0         2
## 236 TCGA.MN.A4N4 -0.8783555 1175      0         2
## 70  TCGA.50.7109 -0.5137000  308      1         2
## 80  TCGA.55.6968  0.3502241 1293      1         2
# Validation test for BMD+DAD-screening + variable selection
opt.cutoff.BMD.DAD <- fitTraining.BMD.DAD$opt.cutoff
# pdf(file="results/Test_KM_BMD_DAD_lung_minPValue.pdf",width=15, height=7)
tic("Run time")
fitTesting.BMD.DAD <- CosmonetTesting(
  x2.norm,
  y2,
  screenVars.BMD.DAD,
  beta.BMD.DAD,
  opt.cutoff.BMD.DAD
)
Testing Phase (BMD+DAD-screening): (A) Kaplan-Maier curves and risk table and (B) distribution of the risk score PI.

Figure 40: Testing Phase (BMD+DAD-screening): (A) Kaplan-Maier curves and risk table and (B) distribution of the risk score PI

toc()
## Run time: 0.977 sec elapsed
# dev.off()

df.test.BMD.DAD <- fitTesting.BMD.DAD$df
head(df.test.BMD.DAD)
##           sample       PI time status groupRisk
## 233 TCGA.L9.A7SV 1.679135  565      0         2
## 64  TCGA.50.5942 1.860430 1847      0         2
## 224 TCGA.99.AA5R 1.958567  658      0         2
## 228 TCGA.L4.A4E6 1.969429  435      0         2
## 26  TCGA.44.5645 1.984508  852      0         2
## 218 TCGA.97.A4M2 2.043961  624      0         2

Venn diagram plotting

We use VennPlot() function to show the relation between low-risk groups (or high-risk groups).

# High-risk groups
highRisk.BMD <- as.character(df.test.BMD$sample[which(df.test.BMD$groupRisk==1)])
highRisk.DAD <- as.character(df.test.DAD$sample[which(df.test.DAD$groupRisk==1)])
highRisk.BMD.DAD <- as.character(df.test.BMD.DAD$sample[which(df.test.BMD.DAD$groupRisk==1)])

# Low-risk groups
lowRisk.BMD <- as.character(df.test.BMD$sample[which(df.test.BMD$groupRisk==2)])
lowRisk.DAD <- as.character(df.test.DAD$sample[which(df.test.DAD$groupRisk==2)])
lowRisk.BMD.DAD <- as.character(df.test.BMD.DAD$sample[which(df.test.BMD.DAD$groupRisk==2)])
# Display Venn Diagram
vennPlot1 <- VennPlot(lowRisk.BMD, lowRisk.DAD, lowRisk.BMD.DAD, "LR-Group")
Venn diagramm of low-risk groups

Figure 41: Venn diagramm of low-risk groups

## NULL
# Display Venn Diagram
vennPlot2 <- VennPlot(highRisk.BMD, highRisk.DAD, highRisk.BMD.DAD, "HR-Group")
Venn diagramm of high-risk groups

Figure 42: Venn diagramm of high-risk groups

## NULL

Heat-map visualization

We use HeatmapSurvGroup() function to visualize the heat-map of the gene expression matrix on the training and testing dataset.

Heat-maps of the BMD-genes on training and testing set

Figure 43: Heat-maps of the BMD-genes on training and testing set

Heat-maps of the DAD-genes on training and testing set

Figure 44: Heat-maps of the DAD-genes on training and testing set

 Heat-maps of the BMD+DAD-genes on training and testing set

Figure 45: Heat-maps of the BMD+DAD-genes on training and testing set

Figure 42-44 show the Z-score expression matrix of the BMD-, DAD- and BMD+DAD-genes in the training and testing, respectively.

Correlation between PIs

We use corrPlot() function to perform the PI-correlation analysis on the training and testing set.

# Renames columns
colnames(df.train.BMD)[5] <- "groupRisk.BMD"
colnames(df.train.DAD)[5] <- "groupRisk.DAD"
colnames(df.train.BMD.DAD)[5] <- "groupRisk.BMD.DAD"

# Correlation plots of prognostic index PI
corrPlot1 <- CorrPlot(df1 = df.train.BMD,
                      df2 = df.train.DAD,
                      groupRisk1 = "groupRisk.BMD",
                      groupRisk2 = "groupRisk.DAD",
                      method="spearman",
                      name1 = "BMD",
                      name2 = "DAD",
                      set = "Training set"
                      )
corrPlot2 <- CorrPlot(df1 = df.train.BMD,
                      df2 = df.train.BMD.DAD,
                      groupRisk1 = "groupRisk.BMD",
                      groupRisk2 = "groupRisk.BMD.DAD",
                      method="spearman",
                      name1 = "BMD",
                      name2 = "BMD.DAD",
                      set = "Training set"
                      )

# Visualization
ggarrange(corrPlot1,corrPlot2, nrow = 1, ncol = 2, common.legend = T)
PI-correlation plots on training set

Figure 46: PI-correlation plots on training set

# ggsave(file="results/corr_train_lung.pdf",width=15, height=7)
# Rename columns
colnames(df.test.BMD)[5] <- "groupRisk.BMD"
colnames(df.test.DAD)[5] <- "groupRisk.DAD"
colnames(df.test.BMD.DAD)[5] <- "groupRisk.BMD.DAD"

# Correlation plots of prognostic index PI
corrPlot1 <- CorrPlot(df1 = df.test.BMD,
                      df2 = df.test.DAD,
                      groupRisk1 = "groupRisk.BMD",
                      groupRisk2 = "groupRisk.DAD",
                      method="spearman",
                      name1 = "BMD",
                      name2 = "DAD",
                      set = "Testing set"
                      )
corrPlot2 <-CorrPlot(df1 = df.test.BMD,
                      df2 = df.test.BMD.DAD,
                      groupRisk1 = "groupRisk.BMD",
                      groupRisk2 = "groupRisk.BMD.DAD",
                      method="spearman",
                      name1 = "BMD",
                      name2 = "BMD.DAD",
                      set = "Testing set"
                      )

# Visualization
ggarrange(corrPlot1,corrPlot2, nrow = 1, ncol = 2, common.legend = T)
PI-correlation plots on testing set

Figure 47: PI-correlation plots on testing set

ggsave(file="results/corr_test_lung.pdf",width=15, height=7)

Pathway Analysis

Finally, the function GenerateNetwork() can be used to visualise the KEGG pathways shared between the gene signatures selected by the screening methods.

The example below generates the dashboard to investigate the KEGG pathways for the genes selected by the BMD+DAD screening.

# # Build a data frame composed by two columns: gene names and beta coefficients
# ListOfGenes <- data.frame(Symbol=rownames(beta.BMD.DAD), coefficients=beta.BMD.DAD[,1])
# rownames(ListOfGenes) <-1:dim(ListOfGenes)[1]
# # write.table(ListOfGenes, "ListOfGenes.txt", sep = "\t", row.names = F, col.names = T)
#
# tic("Run time")
# networkLung <- GenerateNetwork(
#   ListOfGenes = ListOfGenes,
#   header = TRUE,
#   diseaseID = "DOID:1324",
#   nodesCols = c('#EFCFD4','#D9E3FC','#FFFFFF')
# )
# toc()

Conclusions

COSMONET is a powerful package that can be used to identify new cancer biomarkers through screening and Cox-network approaches. Here, we presented two examples to show simple applications of the package. As reported in Iuliano et al., 2018, the framework implemented in COSMONET provides many more advantages than the ones shown above. Among others: (i) the screening-network procedures outperforms classical penalized Cox-regression methods, allowing to select sub-network signatures, (ii) the DAD-screening has good performance in absence of any previous information but it is sub optimal with respect to the BMD-screening, and (iii) BMD+DAD screening improves both BMD and DAD the performance of COSMONET. These results may improve prediction accuracy and guide clinicians in choosing specific therapies for better survival and in avoiding unnecessary treatments.

Session info

## R version 4.1.1 (2021-08-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] grid      parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] visNetwork_2.1.0    randomcoloR_1.1.0.1 graph_1.70.0       
##  [4] BiocGenerics_0.38.0 RColorBrewer_1.1-2  patchwork_1.1.1    
##  [7] pheatmap_1.0.12     ggplotify_0.1.0     VennDiagram_1.7.1  
## [10] futile.logger_1.4.3 miscset_1.1.0       gridExtra_2.3      
## [13] survminer_0.4.9     ggpubr_0.4.0        survival_3.2-13    
## [16] pbmcapply_1.5.0     plyr_1.8.6          APML0_0.10         
## [19] Matrix_1.3-4        caret_6.0-90        lattice_0.20-45    
## [22] ggplot2_3.3.5       magic_1.5-9         abind_1.4-5        
## [25] igraph_1.2.9        biomaRt_2.48.3      tictoc_1.0.1       
## [28] COSMONET_0.1.0      BiocStyle_2.20.2   
## 
## loaded via a namespace (and not attached):
##   [1] utf8_1.2.2             tidyselect_1.1.1       htmlwidgets_1.5.4     
##   [4] RSQLite_2.2.8          AnnotationDbi_1.54.1   flexdashboard_0.5.2   
##   [7] Rtsne_0.15             pROC_1.18.0            devtools_2.4.3        
##  [10] maxstat_0.7-25         munsell_0.5.0          codetools_0.2-18      
##  [13] future_1.23.0          withr_2.4.3            colorspace_2.0-2      
##  [16] Biobase_2.52.0         filelock_1.0.2         highr_0.9             
##  [19] knitr_1.36             rstudioapi_0.13        stats4_4.1.1          
##  [22] ggsignif_0.6.3         listenv_0.8.0          labeling_0.4.2        
##  [25] GenomeInfoDbData_1.2.6 KMsurv_0.1-5           bit64_4.0.5           
##  [28] farver_2.1.0           rprojroot_2.0.2        parallelly_1.29.0     
##  [31] vctrs_0.3.8            generics_0.1.1         lambda.r_1.2.4        
##  [34] ipred_0.9-12           xfun_0.28              BiocFileCache_2.0.0   
##  [37] R6_2.5.1               markdown_1.1           GenomeInfoDb_1.28.4   
##  [40] bitops_1.0-7           cachem_1.0.6           gridGraphics_0.5-1    
##  [43] assertthat_0.2.1       scales_1.1.1           nnet_7.3-16           
##  [46] gtable_0.3.0           globals_0.14.0         processx_3.5.2        
##  [49] timeDate_3043.102      rlang_0.4.12           splines_4.1.1         
##  [52] rstatix_0.7.0          ModelMetrics_1.2.2.2   broom_0.7.10          
##  [55] BiocManager_1.30.16    yaml_2.2.1             reshape2_1.4.4        
##  [58] backports_1.4.0        gridtext_0.1.4         tools_4.1.1           
##  [61] lava_1.6.10            usethis_2.1.3          bookdown_0.24         
##  [64] ellipsis_0.3.2         jquerylib_0.1.4        sessioninfo_1.2.1     
##  [67] Rcpp_1.0.7             progress_1.2.2         zlibbioc_1.38.0       
##  [70] purrr_0.3.4            RCurl_1.98-1.5         ps_1.6.0              
##  [73] prettyunits_1.1.1      rpart_4.1-15           cowplot_1.1.1         
##  [76] S4Vectors_0.30.2       zoo_1.8-9              cluster_2.1.2         
##  [79] exactRankTests_0.8-34  fs_1.5.1               magrittr_2.0.1        
##  [82] data.table_1.14.2      futile.options_1.0.1   mvtnorm_1.1-3         
##  [85] pkgload_1.2.4          hms_1.1.1              evaluate_0.14         
##  [88] xtable_1.8-4           XML_3.99-0.8           IRanges_2.26.0        
##  [91] testthat_3.1.1         compiler_4.1.1         tibble_3.1.6          
##  [94] V8_3.6.0               crayon_1.4.2           htmltools_0.5.2       
##  [97] mgcv_1.8-38            tidyr_1.1.4            ggtext_0.1.1          
## [100] lubridate_1.8.0        DBI_1.1.1              formatR_1.11          
## [103] dbplyr_2.1.1           MASS_7.3-54            rappdirs_0.3.3        
## [106] car_3.0-12             cli_3.1.0              gower_0.2.2           
## [109] pkgconfig_2.0.3        km.ci_0.5-2            recipes_0.1.17        
## [112] xml2_1.3.3             foreach_1.5.1          bslib_0.3.1           
## [115] XVector_0.32.0         prodlim_2019.11.13     yulab.utils_0.0.4     
## [118] stringr_1.4.0          callr_3.7.0            digest_0.6.29         
## [121] Biostrings_2.60.2      rmarkdown_2.11         survMisc_0.5.5        
## [124] curl_4.3.2             lifecycle_1.0.1        nlme_3.1-153          
## [127] jsonlite_1.7.2         carData_3.0-4          desc_1.4.0            
## [130] fansi_0.5.0            pillar_1.6.4           KEGGREST_1.32.0       
## [133] fastmap_1.1.0          httr_1.4.2             pkgbuild_1.2.1        
## [136] glue_1.5.1             remotes_2.4.2          png_0.1-7             
## [139] iterators_1.0.13       bit_4.0.4              class_7.3-19          
## [142] stringi_1.7.6          sass_0.4.0             blob_1.2.2            
## [145] memoise_2.0.1          dplyr_1.0.7            future.apply_1.8.1