COSMONET 0.1.0
This tutorial is a guide on how to run and use the COSMONET R package. It is available from http://bioinfo.na.iac.cnr.it/cosmonet/.
COSMONET depends on different R packages such as APML0, survival, ggpurb, igraph, pheamap.
COSMONET can be installed through the following commands in the R console.
# ## From the command line
# ### 1. Open a terminal window
# ### 2. Go to the directory that contains your package directory.
# ### 3. Install the package
# R CMD INSTALL COSMONET_0.1.tar.gz
#
# ## From local source file
# ### Go to the link http://bioinfo.na.iac.cnr.it/cosmonet/ and download the package from the Repository - R package
# install.packages("COSMONET_0.1.0.tar.gz", repos = NULL, type="source")
COSMONET to investigate cancer dataThe aim of this section is to provide the users with a general view of the package, including the main functions, and to describe the package’s outputs. First, we upload the COSMONET R package.
library(COSMONET)
For illustration purposes, we use two examples of cancer data that are included in our package. The data processed by COSMONET are microarray from GEO database and RNA-seq data from GDC data portal. The aim of the study is to identify the genes and pathways that play a crucial role in the severity of the disease related to the survival of cancer patients.
Users can download the normalized data and survival information directly from the Repository of the website (Repository/Data/Download Data).
For microarray data, we consider expression measurements of two different breast cancer datasets from GEO database: * GSE2034 dataset as training set (Affymetrix Human Genome U133A Array); * GSE2990 dataset as testing set (Affymetrix Human Genome U133A Array).
RMA (Robust Multichip Average) function from affy R package for within arrays normalization and normalizeBetweenArrays function from limma R package for between arrays normalization are used for the normalization of each single dataset. The annotation is made by using biomart R package after removing redundant probe sets, i.e. probe sets that measure different regions of the same gene. Generally, the pre-processing of the data are done by the users. The summary data is showed in the table below:
| GEO Accession | Platform | Nr. of genes | Nr. of samples | Survival data |
|---|---|---|---|---|
| GSE2034 | Affymetrix HG-U133A | 13229 | 286 | Relapse Free Survival (RFS) |
| GSE2990 | Affymetrix HG-U133A | 13229 | 187 | Relapse Free Survival (RFS) |
In this study, the survival information is relapse-free survival, i.e., the length of time after primary treatment for a cancer ends that the patient survives without any signs or symptoms of that cancer. In other words, measure the relapse-free survival means to see how well a new treatment works. Also called DFS disease-free survival, and RFS.
To investigate the datasets, we first load the data in an R session and then, we perform the normalization between the training and testing set by using NormalizeBetweenData() function.
# Load the normalized data
load("data/BreastData_Normalized.RData")
# Load survival information: RFS - time (months) and status (alive=0 or death=1)
load("data/BreastData_Survival.RData")
To incorporate the a-priori biological information in the penalized regression model, we need to create the network matrix. In our analysis, to construct the molecular graph, we use CreateNetworkMatrix() function. The disease code associated with breast cancer is (DOID:1612.
# Load the tissue specific data
library(tictoc)
tic("Run time")
load("data/BreastData_TissueSpecificEdge.rda")
NetworkMatrix <- CreateNetworkMatrix(
x1.norm,
repositoryDisease,
diseaseID = "DOID:1612",
topGenes = 500,
tissueSpecificEdge,
output = TRUE,
message = TRUE
)
## [1] "Your list contains 500 genes. 500 out of 500 are used in the network."
## [1] "There are 43865 edges left."
toc()
## Run time: 297.62 sec elapsed
# Network matrix
Omega <- as.matrix(NetworkMatrix$Omega)
dim(Omega)
## [1] 13229 13229
In this section we present the application of the three screening techniques. We use the function ScreeningMethod() to identify the screened genes.
# Selection of the BMD-screened genes
cancer.genes <- NetworkMatrix$DiseaseGenes[,2]
screenVars.BMD <- ScreeningMethod(
x1.norm,
cancer.genes,
screening="bmd",
ranking=NULL,
thresh=NULL
)
## [1] "Number of BMD-screened genes: 437"
# Ranking of genes that are highly correlated with the patient's survival
load("data/BreastData_RankingGenes.RData")
head(rankingGenes)
## original.index symbol marg.coeff beta pvals
## 1 2449 CHRNA9 10.437125 -10.437125 2.030443e-02
## 2 3012 CXCL6 9.836375 -9.836375 2.122943e-02
## 3 9444 RACGAP1 8.915556 8.915556 1.484894e-07
## 4 7834 NR2E3 8.896399 -8.896399 8.931928e-05
## 5 2494 CKAP2 8.294733 8.294733 5.502541e-07
## 6 7414 MZB1 8.282109 -8.282109 2.372354e-04
The ranking of genes is obtained applying MarginalCoxRanking() function. Users can download the file of the ranked genes directly from the website (Repository/Data/Download Data).
# Plots
par(mfrow = c(1, 2))
plot(rankingGenes$marg.coeff, rankingGenes$pvals, col = "blue", pch = 20, cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5)
# Add the threshold line related to the 50th marginal coefficient (equal to 6.851106)
abline(v=rankingGenes[floor(dim(x1.norm)[1]/log(dim(x1.norm)[1])),3], col="red",lwd=2, lty=2)
plot(rankingGenes$pvals, 1:dim(x1.norm)[2], col = "blue", pch = 20, cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5)
# Add the threshold line equal to 50 (i.e., the first 50 genes)
abline(h=floor(dim(x1.norm)[1]/log(dim(x1.norm)[1])), col="red",lwd=2, lty=2)
Figure 1: Plots based on the \(p\)-values and marginal coefficients of the ranked cancer genes
These plots give an idea of the number of screened genes that can be involved in the analysis to give significance to the results (in the first plot the significant genes are positioned to the right of the dashed red line; in the second one the genes are are below the dashed red line). In particular, we set the threshold equal to \(\lfloor n/ log n \rfloor=50\), i.e., the 50 first genes as DAD-screened genes.
# Selection of the DAD-screened genes
screenVars.DAD <- ScreeningMethod(
x1.norm,
cancer.genes,
screening="dad",
ranking=rankingGenes,
thresh=floor(dim(x1.norm)[1]/log(dim(x1.norm)[1]))
)
## [1] "Number of DAD-screened genes: 50"
# Union of BMD- and DAD-screeened genes
screenVars.BMD.DAD <- ScreeningMethod(
x1.norm,
cancer.genes,
screening="bmd+dad",
ranking=rankingGenes,
thresh=floor(dim(x1.norm)[1]/log(dim(x1.norm)[1]))
)
## [1] "Number of BMD+DAD-screened genes: 549"
We consider as resulting set the union of the BMD and DAD-screened genes.
Now we are able to apply the penalized methods using the subset of screened variables in order to identify the active regression coefficients and the optimal cutoff (based on the prognostic index) on the training set. We can get reproducible results by fixing \(alpha\) (network influence) and the elements of each folds in the 5-fold cross validation. We perform the complete procedure applying CosmonetTraining() and SelectOptimalCutoff() functions.
# Set alpha and lambda parameters
alpha <- 0.5
# lambda <- 10^(seq(0,-2,-0.05))
# For reproducibility of the results
library(caret)
set.seed(2021)
flds <- createFolds(y1$time, k = 5, list = TRUE, returnTrain = FALSE)
foldids <- rep(1,dim(y1)[1])
foldids[flds$Fold2] = 2
foldids[flds$Fold3] = 3
foldids[flds$Fold4] = 4
foldids[flds$Fold5] = 5
# BMD-screening + variable selection
# pdf(file="results/Train_KM_BMD_median.pdf",width=15, height=7)
tic("Run time")
fitTraining.BMD <- CosmonetTraining(
10,
x1.norm,
y1,
screenVars.BMD,
family="Cox",
penalty="Net",
Omega=Omega,
alpha=alpha,
nlambda = 50,
foldid = foldids,
selOptLambda="min",
optCutpoint="median"
)
Figure 2: Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot
Figure 3: Training Phase (BMD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot
toc()
## Run time: 11.303 sec elapsed
# dev.off()
# Regression coefficients
beta.BMD <- fitTraining.BMD$beta
# Number of non-zero beta
sum(beta.BMD!=0)
## [1] 40
# write.table(beta.BMD, file = "results/beta_BMD_breast.txt", row.names = T, col.names = F)
# Optimal parameters
pars.BMD <- data.frame(alpha=0.5,opt.lambda=fitTraining.BMD$opt.lambda,opt.cutoff=fitTraining.BMD$opt.cutoff)
pars.BMD
## alpha opt.lambda opt.cutoff
## 1 0.5 0.11899 -7.3697
# Summary tables
df.train.BMD <- fitTraining.BMD$df
head(df.train.BMD)
## sample PI time status 50%
## 1 GSM37048 -8.358512 87 0 2
## 2 GSM36945 -8.352637 92 0 2
## 3 GSM36984 -8.342822 138 0 2
## 4 GSM36852 -8.267668 107 0 2
## 5 GSM36883 -8.254200 88 0 2
## 6 GSM36845 -8.223623 54 0 2
summary.BMD <- fitTraining.BMD$summary
summary.BMD
## Low.Risk High.Risk cutoff p.value
## 50% 143 143 -7.3697 2.82e-14
# write.table(summary.BMD, file = "results/summary_BMD_breast.txt", row.names = T, col.names = T)
# DAD-screening + variable selection
# pdf(file="results/Train_KM_DAD_survCutpoint.pdf",width=15, height=7)
tic("Run time")
fitTraining.DAD <- CosmonetTraining(
10,
x1.norm,
y1,
screenVars.DAD,
family="Cox",
penalty="Net",
Omega=Omega,
alpha=alpha,
nlambda = 50,
foldid = foldids,
selOptLambda="min",
optCutpoint="survCutpoint"
)
Figure 4: Training Phase (DAD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot
Figure 5: Training Phase (DAD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot
toc()
## Run time: 1.624 sec elapsed
# dev.off()
# Regression coefficients
beta.DAD <- fitTraining.DAD$beta
# Number of non-zero beta
sum(beta.DAD!=0)
## [1] 32
# write.table(beta.DAD, file = "results/beta_DAD_breast.txt", row.names = T, col.names = F)
# Optimal parameters
pars.DAD <- data.frame(alpha=0.5,opt.lambda=fitTraining.DAD$opt.lambda,opt.cutoff=fitTraining.DAD$opt.cutoff)
pars.DAD
## alpha opt.lambda opt.cutoff
## 1 0.5 0.04740321 -10.15184
# Summary tables
df.train.DAD <- fitTraining.DAD$df
head(df.train.DAD)
## sample PI time status groupRisk id
## 1 GSM36933 -13.83397 110 0 2 1
## 2 GSM36817 -13.68776 113 0 2 2
## 3 GSM36851 -13.37413 100 0 2 3
## 4 GSM36915 -13.31251 97 0 2 4
## 5 GSM36863 -13.23986 98 0 2 5
## 6 GSM36794 -13.04705 87 0 2 6
summary.DAD <- fitTraining.DAD$summary
summary.DAD
## cutpoint statistic
## PI -10.15184 10.04115
# write.table(summary.DAD, file = "results/summary_DAD_breast.txt", row.names = T, col.names = T)
# BMD+DAD-screening + variable selection
# pdf(file="results/Train_KM_BMD_DAD_median.pdf",width=15, height=7)
tic("Run time")
fitTraining.BMD.DAD <- CosmonetTraining(
10,
x1.norm,
y1,
screenVars.BMD.DAD,
family="Cox",
penalty="Net",
Omega=Omega,
alpha=alpha,
nlambda = 50,
foldid = foldids,
selOptLambda="min",
optCutpoint="median"
)
Figure 6: Training Phase (BMD+DAD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot
Figure 7: Training Phase (BMD+DAD-screening): cross-validation plot , Kaplan-Meier curves and distribution plot
toc()
## Run time: 10.647 sec elapsed
# dev.off()
# Regression coefficients
beta.BMD.DAD <- fitTraining.BMD.DAD$beta
sum(beta.BMD.DAD!=0)
## [1] 67
# write.table(beta.BMD.DAD, file = "results/beta_BMD_DAD_breast.txt", row.names = T, col.names = F)
# Optimal parameters
pars.BMD.DAD <- data.frame(alpha=0.5,opt.lambda=fitTraining.BMD.DAD$opt.lambda,opt.cutoff=fitTraining.BMD.DAD$opt.cutoff)
pars.BMD.DAD
## alpha opt.lambda opt.cutoff
## 1 0.5 0.08331165 -23.89154
# Summary tables
df.train.BMD.DAD <- fitTraining.BMD.DAD$df
head(df.train.BMD.DAD)
## sample PI time status 50%
## 1 GSM36868 -26.13637 123 0 2
## 2 GSM36863 -26.03310 98 0 2
## 3 GSM37059 -25.95335 83 0 2
## 4 GSM36984 -25.90174 138 0 2
## 5 GSM36780 -25.86187 84 0 2
## 6 GSM36821 -25.83046 88 0 2
summary.BMD.DAD <- fitTraining.BMD.DAD$summary
summary.BMD.DAD
## Low.Risk High.Risk cutoff p.value
## 50% 143 143 -23.89154 0
# write.table(summary.BMD.DAD, file = "results/summary_BMD_DAD_breast.txt", row.names = T, col.names = T)
We perform the testing phase by using CosmonetTesting() and ValidationTest() functions to validate the survival model.
# Validation test for BMD-screening + variable selection
opt.cutoff.BMD <- fitTraining.BMD$opt.cutoff
# pdf(file="results/Test_KM_BMD_breast.pdf",width=15, height=7)
tic("Run time")
fitTesting.BMD <- CosmonetTesting(
x2.norm,
y2,
screenVars.BMD,
beta.BMD,
opt.cutoff.BMD
)
Figure 8: Testing Phese (BMD-screening): (A) Kaplan-Maier curves and risk table and (B) distribution of the risk score PI
toc()
## Run time: 0.812 sec elapsed
# dev.off()
# Summary table
df.test.BMD <- fitTesting.BMD$df
head(df.test.BMD)
## sample PI time status groupRisk
## 44 GSM65361 -8.539579 88.273973 0 2
## 182 GSM65875 -8.457235 107.243836 0 2
## 169 GSM65862 -8.424892 142.323288 0 2
## 41 GSM65358 -8.384166 66.969863 0 2
## 25 GSM65342 -8.350312 2.169863 0 2
## 185 GSM65878 -8.348892 8.810959 1 2
# Validation test for DAD-screening + variable selection
opt.cutoff.DAD <- fitTraining.DAD$opt.cutoff
# pdf(file="results/Test_KM_DAD_breast.pdf",width=15, height=7)
tic("Run time")
fitTesting.DAD <- CosmonetTesting(
x2.norm,
y2,
screenVars.DAD,
beta.DAD,
opt.cutoff.DAD
)
Figure 9: Testing Phase (DAD-screening): (A) Kaplan-Maier curves and risk table and (B) distribution of the risk score PI
toc()
## Run time: 0.79 sec elapsed
# dev.off()
df.test.DAD <- fitTesting.DAD$df
head(df.test.DAD)
## sample PI time status groupRisk
## 102 GSM65794 -12.51275 103.95616 0 2
## 107 GSM65799 -12.40898 87.94521 0 2
## 172 GSM65865 -12.40538 144.72329 0 2
## 150 GSM65843 -12.29438 84.23014 1 2
## 14 GSM65331 -11.85027 114.01644 0 2
## 178 GSM65871 -11.82383 104.84384 0 2
# Validation test for BMD+DAD-screening + variable selection
opt.cutoff.BMD.DAD <- fitTraining.BMD.DAD$opt.cutoff
# pdf(file="results/Test_KM_BMD_DAD_breast.pdf",width=15, height=7)
tic("Run time")
fitTesting.BMD.DAD <- CosmonetTesting(
x2.norm,
y2,
screenVars.BMD.DAD,
beta.BMD.DAD,
opt.cutoff.BMD.DAD
)
Figure 10: Testing Phase (BMD+DAD-screening): (A) Kaplan-Maier curves and risk table and (B) distribution of the risk score PI
toc()
## Run time: 0.775 sec elapsed
# dev.off()
df.test.BMD.DAD <- fitTesting.BMD.DAD$df
head(df.test.BMD.DAD)
## sample PI time status groupRisk
## 169 GSM65862 -26.11466 142.3233 0 2
## 172 GSM65865 -25.99728 144.7233 0 2
## 178 GSM65871 -25.71374 104.8438 0 2
## 171 GSM65864 -25.70033 120.5260 1 2
## 29 GSM65346 -25.55507 18.6411 0 2
## 161 GSM65854 -25.54970 147.1890 0 2
We use venn diagrams to show the relation between low-risk groups (or high-risk groups) of each screening method by considering the testing set. We use VennPlot() function.
# Low-risk groups
lowRisk.BMD <- as.character(df.test.BMD$sample[which(df.test.BMD$groupRisk==2)])
lowRisk.DAD <- as.character(df.test.DAD$sample[which(df.test.DAD$groupRisk==2)])
lowRisk.BMD.DAD <- as.character(df.test.BMD.DAD$sample[which(df.test.BMD.DAD$groupRisk==2)])
# High-risk groups
highRisk.BMD <- as.character(df.test.BMD$sample[which(df.test.BMD$groupRisk==1)])
highRisk.DAD <- as.character(df.test.DAD$sample[which(df.test.DAD$groupRisk==1)])
highRisk.BMD.DAD <- as.character(df.test.BMD.DAD$sample[which(df.test.BMD.DAD$groupRisk==1)])
# Display Venn Diagram
vennPlot1 <- VennPlot(lowRisk.BMD, lowRisk.DAD, lowRisk.BMD.DAD, "LR-Group")
Figure 11: Venn diagramm of low-risk groups on testing set
## NULL
# Display Venn Diagram
vennPlot2 <- VennPlot(highRisk.BMD, highRisk.DAD, highRisk.BMD.DAD, "HR-Group")
Figure 12: Venn diagramm of high-risk groups on testing set
## NULL
We use HeatmapSurvGroup() function to visualize the heat-map of the gene expression matrix on the training and testing set, respectively, by taking into account the screened genes and the prognostic information (high- and low-risk group).
Figure 13: Heat-maps of the BMD-genes on training and testing set
Figure 14: Heat-maps of the DAD-genes on training and testing set
Figure 15: Heat-maps of the BMD+DAD-genes on training and testing set
Figure 13-15 show the Z-score expression matrix of the BMD-, DAD- and BMD+DAD-genes in the training and testing, respectively. The genes in the horizontal direction are clustered in the same order in both sets. Z-scores of the gene expression measurements are used. Z-scores larger than 3.5 were set to 3.5 and Z-scores smaller than -3.5 were set to -3.5. Dark pink color indicates a high level of expression in breast cancer and green color indicates a low level of expression. The patients are divided in high-risk and low-risk groups (i.e., patients with bad and good prognosis).
We perform a correlation analysis on the training and testing set to further investigate the relation between prognostic indices PIs. We use corrPlot() function.
# Renames columns
colnames(df.train.BMD)[5] <- "groupRisk.BMD"
colnames(df.train.DAD)[5] <- "groupRisk.DAD"
colnames(df.train.BMD.DAD)[5] <- "groupRisk.BMD.DAD"
# Correlation plots of prognostic index PI
corrPlot1 <- CorrPlot(df1 = df.train.BMD,
df2 = df.train.DAD,
groupRisk1 = "groupRisk.BMD",
groupRisk2 = "groupRisk.DAD",
method="spearman",
name1 = "BMD",
name2 = "DAD",
set = "Training set"
)
corrPlot2 <- CorrPlot(df1 = df.train.BMD,
df2 = df.train.BMD.DAD,
groupRisk1 = "groupRisk.BMD",
groupRisk2 = "groupRisk.BMD.DAD",
method="spearman",
name1 = "BMD",
name2 = "BMD.DAD",
set = "Training set"
)
# Visualization
ggarrange(corrPlot1,corrPlot2, nrow = 1, ncol = 2, common.legend = T, font.label=list(color="black",size=14))
Figure 16: PI-correlation plots on training set
# ggsave(file="results/corr_train_breast.pdf",width=15, height=7)
# Rename columns
colnames(df.test.BMD)[5] <- "groupRisk.BMD"
colnames(df.test.DAD)[5] <- "groupRisk.DAD"
colnames(df.test.BMD.DAD)[5] <- "groupRisk.BMD.DAD"
# Correlation plots of prognostic index PI
corrPlot1 <- CorrPlot(df1 = df.test.BMD,
df2 = df.test.DAD,
groupRisk1 = "groupRisk.BMD",
groupRisk2 = "groupRisk.DAD",
method="spearman",
name1 = "BMD",
name2 = "DAD",
set = "Testing set"
)
corrPlot2 <- CorrPlot(df1 = df.test.BMD,
df2 = df.test.BMD.DAD,
groupRisk1 = "groupRisk.BMD",
groupRisk2 = "groupRisk.BMD.DAD",
method="spearman",
name1 = "BMD",
name2 = "BMD.DAD",
set = "Testing set"
)
# Visualization
ggarrange(corrPlot1,corrPlot2, nrow = 1, ncol = 2, common.legend = T, font.label=list(color="black",size=14))
Figure 17: PI-correlation plots on testing set
# ggsave(file="results/corr_test_breast.pdf",width=15, height=7)
COSMONET includes an RMarkdown script to generate an interactive dashboard of KEGG pathways networks. We report the network created running the proposed model using the BMD+DAD-screened genes. The dashboard contains two tabs, one for the not-isolated genes and one for the full network. Each node corresponds to a gene and the edges represent the KEGG pathways shared by the linked genes. Different node colors are used to show a correlation measure between the genes and the disease under investigation as follows:
Therefore, red nodes represent the BMD contribution to the signature and blue and white nodes the DAD contribution, not yet captured in the BMD list. Note that some of the red genes might also be retrieved from the data-driven screening. However, in this context, we want to underline and make sense of the novel information not yet considered.
We implemented the GenerateNetwork() function to visualize the interactive networks. The colours of the nodes and the ID of the disease to investigate can be changed by the user passing the new values as function’s parameters in RUNCosmonetDashboard.Rmd loaded in the same directory as the vignetta.
# Build a data frame composed by two columns: gene names and beta coefficients
ListOfGenes <- data.frame(Symbol=rownames(beta.BMD.DAD), coefficients=beta.BMD.DAD[,1])
rownames(ListOfGenes) <-1:dim(ListOfGenes)[1]
# write.table(ListOfGenes, "ListOfGenes.txt", sep = "\t", row.names = F, col.names = T)
tic("Run time")
networkBreast <- GenerateNetwork(
ListOfGenes = ListOfGenes,
header = TRUE,
diseaseID = "DOID:1612",
nodesCols = c('#EFCFD4','#D9E3FC','#FFFFFF')
)
toc()
The function will automatically open a new html page to visualise the dashboard. An example of the final dashboard is shown in the figure below (and it can be accessed at the following link).
knitr::include_graphics("COSMONET_NETWORK.pdf")