deconvBenchmarking is a versatile R package that facilitates comprehensive evaluation of deconvolution methods using bulk data simulated under various strategies. Built with a focus on realistic bulk simulation, this powerful tool enables un-biased benchmarking of deconvolution methods. It provides the following key features that allow for easy and efficient deconvolution benchmarking:
Given the flexibility of the benchmarking framework, many packages are not included in the package dependencies. For complete accessibility to all the methods within the package, we suggest users to refer to the recommended package list and ensure that all the necessary packages are installed.
library(deconvBenchmarking)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#> Loading required package: tidyr
#> Loading required package: tibble
#> Loading required package: Matrix
#>
#> Attaching package: 'Matrix'
#> The following objects are masked from 'package:tidyr':
#>
#> expand, pack, unpack
#> Loading required package: pbapply
#> Loading required package: fitdistrplus
#> Loading required package: MASS
#>
#> Attaching package: 'MASS'
#> The following object is masked from 'package:dplyr':
#>
#> select
#> Loading required package: survival
#> Loading required package: gtools
#> Loading required package: stringr
#> Loading required package: caret
#> Loading required package: ggplot2
#> Loading required package: lattice
#>
#> Attaching package: 'caret'
#> The following object is masked from 'package:survival':
#>
#> cluster
#> Loading required package: Biobase
#> Loading required package: BiocGenerics
#>
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:dplyr':
#>
#> combine, intersect, setdiff, union
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> anyDuplicated, append, as.data.frame, basename, cbind, colnames,
#> dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
#> grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
#> order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
#> rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
#> union, unique, unsplit, which.max, which.min
#> Welcome to Bioconductor
#>
#> Vignettes contain introductory material; view with
#> 'browseVignettes()'. To cite Bioconductor, see
#> 'citation("Biobase")', and for packages 'citation("pkgname")'.
#> Loading required package: data.table
#>
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#>
#> between, first, last
#> Loading required package: reshape2
#>
#> Attaching package: 'reshape2'
#> The following objects are masked from 'package:data.table':
#>
#> dcast, melt
#> The following object is masked from 'package:tidyr':
#>
#> smiths
Below we will utilize a publicly available single-cell RNA-seq expression dataset to investigate both bulk simulation strategies and the benchmarking framework offered in the in deconvBenchmarking package. Specifically, this scRNA seq data is from melanoma patients with GEO accession number: GSE115978 (Jerby-Arnon et al. 2018), the same dataset we used in our benchmarking work (Hu and Chikina 2023).
pdata = fread('https://ftp.ncbi.nlm.nih.gov/geo/series/GSE115nnn/GSE115978/suppl/GSE115978_cell.annotations.csv.gz')
tpm = fread('https://ftp.ncbi.nlm.nih.gov/geo/series/GSE115nnn/GSE115978/suppl/GSE115978_tpm.csv.gz')
tpm = tpm %>% column_to_rownames('V1')
tpm_unlog=2^tpm-1 # the expression was originally in log scale, unlog it
pdata = pdata %>% column_to_rownames('cells')
core_cells = rownames(pdata)[pdata$cell.types!='?'] # remove unknown cell-types
tpm_unlog = tpm_unlog[,core_cells]
pdata = pdata[core_cells,]
Explore the basic information of the single cell data. Specifically, we are interested in
head(colSums(tpm_unlog)) # scRNA data with library normalization already
#> cy78_CD45_neg_1_B04_S496_comb
#> 1e+05
#> cy79_p4_CD45_neg_PDL1_neg_E11_S1115_comb
#> 1e+05
#> CY88_5_B10_S694_comb
#> 1e+05
#> cy79_p1_CD45_neg_PDL1_pos_AS_C1_R1_F07_S67_comb
#> 1e+05
#> cy78_CD45_neg_3_H06_S762_comb
#> 1e+05
#> cy79_p1_CD45_neg_PDL1_pos_AS_C1_R1_G01_S73_comb
#> 1e+05
head(pdata) # cell.type and patient_id information available for each cell
#> samples cell.types
#> cy78_CD45_neg_1_B04_S496_comb Mel78 Mal
#> cy79_p4_CD45_neg_PDL1_neg_E11_S1115_comb Mel79 Mal
#> CY88_5_B10_S694_comb Mel88 Mal
#> cy79_p1_CD45_neg_PDL1_pos_AS_C1_R1_F07_S67_comb Mel79 Mal
#> cy78_CD45_neg_3_H06_S762_comb Mel78 Mal
#> cy79_p1_CD45_neg_PDL1_pos_AS_C1_R1_G01_S73_comb Mel79 Mal
#> treatment.group Cohort
#> cy78_CD45_neg_1_B04_S496_comb post.treatment Tirosh
#> cy79_p4_CD45_neg_PDL1_neg_E11_S1115_comb treatment.naive Tirosh
#> CY88_5_B10_S694_comb post.treatment Tirosh
#> cy79_p1_CD45_neg_PDL1_pos_AS_C1_R1_F07_S67_comb treatment.naive Tirosh
#> cy78_CD45_neg_3_H06_S762_comb post.treatment Tirosh
#> cy79_p1_CD45_neg_PDL1_pos_AS_C1_R1_G01_S73_comb treatment.naive Tirosh
#> no.of.genes no.of.reads
#> cy78_CD45_neg_1_B04_S496_comb 8258 357919
#> cy79_p4_CD45_neg_PDL1_neg_E11_S1115_comb 2047 5727
#> CY88_5_B10_S694_comb 5375 139218
#> cy79_p1_CD45_neg_PDL1_pos_AS_C1_R1_F07_S67_comb 5648 73996
#> cy78_CD45_neg_3_H06_S762_comb 7409 380341
#> cy79_p1_CD45_neg_PDL1_pos_AS_C1_R1_G01_S73_comb 6988 92485
table(pdata$cell.types) # 9 cell types annotated
#>
#> B.cell CAF Endo. Macrophage Mal NK T.CD4
#> 818 106 104 420 2018 92 856
#> T.CD8 T.cell
#> 1759 706
To guide bulkSimulator
with predetermined cell type
fractions for aggregation, we offered a useful
fracSimulator_Beta()
function. This function enables
realistic cell-type fraction simulation using beta distribution. Use
help(fracSimulator_Beta)
for more details
In this tutorial we will create a fraction weight table containing 100
samples. The diagnostic plot below illustrates that the fraction we
simulated for the malignant cells (Mal) retains a comparable
distribution to that of the input scRNA data.
simulated_frac = fracSimulator_Beta(scMeta = pdata,
n = 100,
colnames_of_sample = 'samples',
colnames_of_cellType = 'cell.types',
fixed_cell_type = 'Mal',
min.frac = 0.01)
We provided a useful bulkSimulator()
function for bulk
simulation with various strategies. The method
argument
enables the user to select from 7 distinct simulation methods we
provide.
Available bulk simulation methods can be queried with
list_bulkSimulator()
function. Specifically, stepping
through homo, semi, and heter (or heter_sampleIDfree) simulation, the
heterogeneity level inside samples is increasing with the final heter
being very similar to real biological bulk samples. These simulated bulk
data can be effectively employed to assess the impact of heterogeneity
on various deconvolution methods. The other three methods are sourced
from previously published benchmarking work or resources that offer bulk
simulation capabilities.
list_bulkSimulator()
also returns the corresponding
function names to execute each method separately, as well as packages
required for each method.
list_bulkSimulator()
#> method function_to_call suggested_packages
#> 1 homo bulkSimulator_homo
#> 2 semi bulkSimulator_semi
#> 3 heter bulkSimulator_heter
#> 4 heter_sampleIDfree bulkSimulator_heter_sampleIDfree scran
#> 5 favilaco bulkSimulator_favilaco
#> 6 immunedeconv bulkSimulator_immunedeconv immunedeconv
#> 7 SCDC bulkSimulator_SCDC SCDC
In the following example, we are building a a list of bulk-simulation
objects generated from all 7 methods provided. Note that the ‘favilaco’
method does not require a predetermined fraction weight table; instead,
it generates the weights directly during simulation. Similarly, the
‘SCDC’ method only supports a predefined fraction weight table when all
cell types are present in all biological samples; otherwise, it will
raise errors. For both of these methods, we will only pass the
nbulk
parameter to determine the number of samples to be
generated.
Use help(bulkSimulator)
for more detailed information
about the parameters and their usage.
simulated.res = bulkSimulator(methods = c('homo','semi','heter','heter_sampleIDfree','favilaco','immunedeconv','SCDC'),
scExpr = tpm_unlog,
scMeta = pdata,
colnames_of_cellType = 'cell.types',
colnames_of_sample = 'samples',
simulated_frac = simulated_frac,
# argument for 'semi' and 'heter_sampleIDfree' only
heter_cell_type = 'Mal',
# argument for 'heter_sampleIDfree' only
dirichlet_cs_par = 0.05, min.subcluster.size = 10,
# argument for 'favilaco' and 'SCDC' only
nbulk = 100,
n.core = 16)
The simulated bulk expression are stored in a list, named by the simulation method
names(simulated.res)
#> [1] "homo" "semi" "heter"
#> [4] "heter_sampleIDfree" "favilaco" "immunedeconv"
#> [7] "SCDC"
Within each simulation object, there is both simulated bulk expression and the corresponding cell-type fraction table
names(simulated.res$homo)
#> [1] "simulated_bulk" "simulated_frac"
simulated.res$homo$simulated_bulk[1:5,1:5]
#> simulated1 simulated2 simulated3 simulated4 simulated5
#> C9orf152 0.03225649 0.02270095 0.01231601 0.01023282 0.01867693
#> ELMO2 50.51284714 58.55703064 45.01181022 41.86111511 49.28709699
#> CREB3L1 1.08347499 1.67154708 1.19301488 0.31713822 1.86339398
#> PNMA1 35.83870873 37.36863668 33.80917117 45.02049255 33.71307638
#> MMP2 21.27101158 65.80600681 26.85940844 13.29984195 60.39544863
head(simulated.res$homo$simulated_frac)
#> Mal B.cell Macrophage NK T.CD4 T.CD8
#> [1,] 0.008170341 0.29711195 0.14014202 0.003288018 0.08627478 0.32420716
#> [2,] 0.176263659 0.19282543 0.01311334 0.005873293 0.18420626 0.32091902
#> [3,] 0.295717683 0.18178891 0.08257291 0.019307961 0.16306604 0.18260040
#> [4,] 0.726467826 0.08583150 0.01048725 0.001040942 0.04611261 0.07531340
#> [5,] 0.088333855 0.09275774 0.04901411 0.013796195 0.06896494 0.23462041
#> [6,] 0.192535295 0.27349112 0.01856578 0.042621953 0.18103902 0.09587791
#> T.cell Endo. CAF
#> [1,] 0.10398426 0.0105950536 0.02622641
#> [2,] 0.03960296 0.0008726330 0.06632340
#> [3,] 0.04621689 0.0049599745 0.02376922
#> [4,] 0.04262230 0.0002124054 0.01191176
#> [5,] 0.24217009 0.1521129673 0.05822970
#> [6,] 0.04193278 0.0801674524 0.07376870
In our recent benchmarking work ((Hu and
Chikina 2023)), we highlighted the importance of realistic bulk
simulation and revealed how it can affect the deconvolution results for
various methods. In this package, we provided a set of visualization
tools to assess the realism of the bulk simulations in terms of
biological variance, pairwise similarity and gene correlations.
To start with, we need a baseline bulk expression profile to compare
with. For this purpose, we can employ the aggregated cells from the same
patient as an approximation of real biological data, as it encompasses
the necessary sample-level variance. We provided a
simulator_Statistics()
function to calculate the summary
statistics for later visual assessment.
simulator_Statistics_res = simulator_Statistics(bulkSimulator_res = simulated.res,
baseline_expr = deconvBenchmarking:::build_ref_matrix(Expr = tpm_unlog,cell_type_labels = pdata$samples),
n.core = 16)
The resulting simulator_Statistics_res
object comprises
a list of summary statistics for each simulated bulk expression and the
baseline expression. By default, gene-correlation
and
sample-correlation
are calculated using the top 300
variable genes from the baseline expression. The
simulator_Statistics
function also allows for user-defined
genes for comparison.
Use help(simulator_Statistics)
for more details.
names(simulator_Statistics_res)
#> [1] "summary_stats" "gene_correlation" "sample_correlation"
Specifically, the summary_stats
table contains
gene-level summary statistics
head(simulator_Statistics_res$summary_stats)
#> mean var cv MAD frac_of_zero gene class
#> 1 0.02366835 0.0002092568 0.61118379 0.01540419 0.03 C9orf152 homo
#> 2 5.55192595 0.0517970677 0.04099293 0.22218668 0.00 ELMO2 homo
#> 3 0.83144059 0.4493479952 0.80623228 0.73748906 0.07 CREB3L1 homo
#> 4 5.25374046 0.0723455290 0.05119614 0.23455803 0.00 PNMA1 homo
#> 5 4.93805293 0.8272436203 0.18418785 0.96870915 0.00 MMP2 homo
#> 6 5.26328424 0.0997377636 0.06000300 0.34329713 0.00 TMEM216 homo
For direct comparison on gene level variance
plot_variance_comparison(simulator_Statistics_res,desired_order = c('homo','favilaco','immunedeconv','semi','heter_sampleIDfree','heter','SCDC')) +
viridis::scale_color_viridis(limits=c(0,12000))
Pairwise-similarities between samples
plot_PairwiseSampleCorrelation(simulator_Statistics_res)
Gene-correlations
By default, the top 300 most variable genes from the baseline expression
are used to calculate gene-correlations.
plot_PairwiseGeneCorrelation(simulator_Statistics_res,nrow_panel = 1,desired_order = c('baseline','homo','favilaco','immunedeconv','semi','heter_sampleIDfree','heter','SCDC'))
In addition, we provide a simulator_averageCV()
function
that can summarize variance level over biological pathways. This
function can either take a gene-list provided by the user or directly
retrieve the hallmark gene list from the msigdbr
package (note that the msigdbr package must be installed for this
feature to work).
hallmark_variance = simulator_averageCV(simulator_Statistics_res)
#> genelist not provided, will retrieve gene sets from msigdbr collections (default category: hallmark gene sets)
With this information, we can compare variance at the pathway level.
library(ComplexHeatmap)
library(circlize)
avg_cv_wide=spread(hallmark_variance,class,avg_cv) %>% column_to_rownames('genelist_name')
h = Heatmap(avg_cv_wide %>% as.matrix(),
col=colorRamp2(seq(0,0.65,0.05),viridis::viridis(14)),
column_names_gp = grid::gpar(fontsize = 8),
row_names_gp = grid::gpar(fontsize = 7),
show_row_dend = F,
heatmap_legend_param = list(title = "Average CV", title_gp = gpar(fontsize = 8),
labels_gp = gpar(fontsize = 8),
legend_direction = "horizontal"))
draw(h,heatmap_legend_side = "bottom")
bulkSimulator()
parametersSo far we have showed the basic usage of the
bulkSimulator()
function, mostly with the default
parameters. In real practice, some of the default parameters can be
customized to align with specific needs and preferences. Below we
provide an overview of the essential parameters for each simulation
method. Note that all arguments for each method are integrated with the
overarching bulkSimulator()
function.
ncells_perSample
parameter determines number of cells
to pool together to generate a simulated bulk sample. Use
help(bulkSimulator_homo)
for details.heter_cell_type
are
constrained to originate from the same biological sample, where for
other cell-types, single cells are aggregated randomly without any
consideration of their origin. use_chunk
determines whether
all the cells or a random portion of cells will be aggregated for the
heter_cell_type
. Use help(bulkSimulator_semi)
for details.use_chunk
determines whether all the cells or a random portion of cells will be
aggregated, and min_chunkSize
indicates the minimum number
of cells to aggregate for each cell type. Use
help(bulkSimulator_heter)
for details.heter_cell_type
, while for the remaining
cell type components, it includes random number of sub-clusters.
max.num.cs
determines the maximum number of sub-clusters to
include for a given cell type. If users have pre-defined sub-cluster
information for each cell type in the metadata, they can specify the
colnames_of_subcluster
parameter to indicate this
information. Otherwise, sub-clustering information for each cell type
will be automatically generated with the
scran::quickCluster()
function, with the
min.subcluster.size
parameter being passed to
scran::quickCluster()
as the min.size
parameter. dirichlet_cs_par
is another hyper-parmeter
determining the dispersion level of sub-cluster fractions. Use
help(bulkSimulator_heter_sampleIDfree)
for details.Parameters associated with other bulk simulation methods can be
queried with help()
functions, or refer to the source
packge of these methods
bulkSimulator()
parametersIn the above section we have presented essential parameters
associated with each method. For easy comparison and visual
understanding of how these parameters affect the simulation process, we
offer the following examples.
Consider bulk simulation with the heter_sampleIDfree
method where we input different min.subcluster.size
and
max.num.cs
parameters.
heter_sampleIDfree_sim1 = bulkSimulator_heter_sampleIDfree(scExpr = tpm_unlog,
scMeta = pdata,
colnames_of_cellType = 'cell.types',
simulated_frac = simulated_frac,
heter_cell_type = 'Mal',
dirichlet_cs_par = 0.05,
min.subcluster.size = 20)
heter_sampleIDfree_sim2 = bulkSimulator_heter_sampleIDfree(scExpr = tpm_unlog,
scMeta = pdata,
colnames_of_cellType = 'cell.types',
simulated_frac = simulated_frac,
heter_cell_type = 'Mal',
dirichlet_cs_par = 0.05,
min.subcluster.size = 10,
max.num.cs = 3)
sim_obj = list(heter_sampleIDfree_sim1 = heter_sampleIDfree_sim1,
heter_sampleIDfree_sim2 = heter_sampleIDfree_sim2)
Get summary statistics for each simulated bulk expression
sim_obj_Statistics = simulator_Statistics(bulkSimulator_res = sim_obj,
baseline_expr = deconvBenchmarking:::build_ref_matrix(Expr = tpm_unlog,cell_type_labels = pdata$samples))
#> Calculating summary statistics: mean, variance, cv and MAD
#> Calculating gene correlations
#> Calculating sample correlations
Visualization of the summary statistics:
Pairwise-similarities between samples
plot_PairwiseSampleCorrelation(sim_obj_Statistics)
Comparison of gene-level variance with respect to baseline expression.
plot_variance_comparison(sim_obj_Statistics) +
viridis::scale_color_viridis(limits=c(0,12000))
Gene-correlations
plot_PairwiseGeneCorrelation(sim_obj_Statistics,nrow_panel = 1,desired_order = c('baseline','heter_sampleIDfree_sim1','heter_sampleIDfree_sim2'))
In the above example, we showed that
heter_sampleIDfree_sim2
showed relatively higher variance
in terms of pairwise similarities between samples. This observation is
reasonable since heter_sampleIDfree_sim2 employs a smaller
min.subcluster.size
and max.num.cs
, indicating
that it aggregates over more distinct sub-clusters within each cell
type. As a consequence, the simulated samples exhibit higher
variance.
The curated collection of bulk simulation methods presented above is seamlessly integrated into our benchmarking framework, enabling comprehensive evaluation of benchmarking framework under various simulation strategies.
We provide a useful benchmarking_init()
function that
takes scRNA profile as input and generate an object intended for future
deconvolution benchmarking, which contains:
benchmarking_obj = benchmarking_init(scExpr = tpm_unlog,
scMeta = pdata,
colnames_of_cellType = 'cell.types',
colnames_of_sample = 'samples',
# argument for fracSimulator_Beta()
nbulk = 100,
fixed_cell_type = 'Mal',
# argument for training/testing splitting:
training_ratio = 0.5,
split_by = "cell",
# arguments for bulk simulation
bulkSimulator_methods = c('homo','semi','heter','heter_sampleIDfree','favilaco','immunedeconv','SCDC'),
# argument for semi/heter_sampleIDfree bulk simulation method
heter_cell_type = 'Mal',
dirichlet_cs_par = 0.05,
min.subcluster.size = 10,
# argument for marker constructions
refMarkers_methods = c('limma','scran'),
# arguments for signature matrices construction
refMatrix_methods = c('raw','limma','scran'),
# arguments to include tcga
include_tcga = T,
tcga_abbreviation = 'SKCM',
purity_methods = c('ESTIMATE', 'ABSOLUTE', 'LUMP', 'IHC', 'CPE'))
Specifically, benchmarking_init()
is an integrated
function that incorporates the functionalities of the following
individual functions, aggregating their respective parameters. For
details about each parameter set linked to these functions, use
help(benchmarking_init)
or call help()
on each
individual function.
names(benchmarking_obj)
#> [1] "splitting" "bulk_simulation_obj" "marker_list"
#> [4] "sigMarix_list"
By setting include_tcga = T
, the function will
automatically download TCGA expression profile from xena browser and retrieve
tcga purity estimate from the TCGAbiolinks
package and GDC.
This information can later be compared with the deconvolution results as
additional evaluation for the deconvolution performance.
names(benchmarking_obj$bulk_simulation_obj)
#> [1] "homo" "semi" "heter"
#> [4] "heter_sampleIDfree" "favilaco" "immunedeconv"
#> [7] "SCDC" "tcga_SKCM"
head(benchmarking_obj$bulk_simulation_obj$tcga_SKCM$simulated_frac)
#> ESTIMATE ABSOLUTE LUMP IHC CPE ABSOLUTE_GDC
#> TCGA-EE-A2GJ-06A 0.7449 NA 0.7161 0.95 0.7012 0.70
#> TCGA-EE-A2GI-06A 0.7761 NA 0.5858 0.95 0.7065 0.54
#> TCGA-WE-A8ZM-06A 0.9227 NA 0.9652 0.80 0.9055 0.85
#> TCGA-DA-A1IA-06A 0.9457 NA 0.9153 0.95 0.9352 0.94
#> TCGA-D3-A51H-06A 0.4881 NA 0.1088 0.75 0.3154 0.10
#> TCGA-XV-A9VZ-01A 0.9383 NA 0.8616 0.75 0.8721 0.67
To list the supported reference construction methods:
list_refMarix(show_description = T)
#> method function_to_call suggested_packages
#> 1 raw refMatrix_raw
#> 2 limma refMatirx_limma limma
#> 3 scran refMatirx_scran BayesPrism
#> 4 markerList refMatirx_markerList
#> description
#> 1 Construct signature matrix using average expression across cell types
#> 2 Construct signature matrix using markers derived from limma DE analysis
#> 3 Construct signature matrix using markers derived from scran DE analysis
#> 4 Construct signature matrix with a pre-calculated marker list
list_refMarkers(show_description = T)
#> method function_to_call suggested_packages
#> 1 limma refMarkers_limma limma
#> 2 scran refMarkers_scran BayesPrism
#> 3 sigMatrixList refMarkers_sigMatrixList
#> description
#> 1 Obtain cell-type specific markers with limma DE analysis
#> 2 Obtain cell-type specific markers with scran DE analysis
#> 3 Obtain cell-type specific markers from a list of signature matrices
We offered a valuable benchmarking_deconv()
function for
deconvolution on the benchmarking_obj with user-selected methods.
Specifically, benchmarking_deconv()
is pre-implemented with
deconvolution methods categorized from four distinct groups:
reference-free, regression-based, marker-based, and Bayesian methods.
Moreover, we provided the option to run deconvolution methods provided
from the immunedeconv
package as well.
deconvResults = benchmarking_deconv(benchmarking_obj,
marker_based_methods = c('firstPC','gsva','debCAM','TOAST'),
# arguments for regression-based methods
regression_based_methods = c('nnls','cibersort','MuSiC','wRLM','RPC'),
cibersort_path = './', # argument for cibersort method only
scExpr = tpm_unlog, # arguments for MuSiC method only
scMeta = pdata,
colnames_of_cellType = 'cell.types',
colnames_of_sample = 'samples',
refFree_methods = c('linseed','debCAM'),
# arguments for Bayesian-based methods
Bayesian_methods = 'InstaPrism',
key = 'malignant', # this argument togther with 'colnames_of_sample' is highly recommended to run Bayesian based methods
immunedeconv_methods = c('mcp_counter','epic','quantiseq','abis','consensus_tme','estimate'),
tcga_abbreviation = 'skcm'
)
Parameters associated with individual categories of deconvolution
methods can be queried using help()
on the following
functions:
To list available deconvolution methods and the dependent packages required for their execution:
list_deconv_Bayesian(show_description = T)
#> method function_to_call suggested_packages
#> 1 InstaPrism deconv_Bayesian_InstaPrism InstaPrism
#> 2 BayesPrism deconv_Bayesian_BayesPrism BayesPrism
#> description
#> 1 Bayesian-based deconvolution with InstaPrism
#> 2 Bayesian-based deconvolution with BayesPrism
list_deconv_marker(show_description = T)
#> method function_to_call suggested_packages
#> 1 firstPC deconv_marker_firstPC FactoMineR
#> 2 gsva deconv_marker_gsva GSVA
#> 3 debCAM deconv_marker_debCAM debCAM
#> 4 TOAST deconv_marker_TOAST TOAST
#> description
#> 1 Marker-based deconvolution with first principal score (abundance estimate only)
#> 2 Marker-based deconvolution with GSVA (abundance estimate only)
#> 3 Marker-based deconvolution with debCAM::AfromMarkers()
#> 4 Marker-based deconvolution with TOAST::MDeconv()
list_deconv_regression(show_description = T)
#> method function_to_call suggested_packages
#> 1 nnls deconv_regression_nnls nnls
#> 2 cibersort deconv_regression_cibersort e1071/parallel/preprocessCore
#> 3 MuSiC deconv_regression_MuSiC MuSiC/SingleCellExperiment
#> 4 wRLM deconv_regression_wRLM LinDeconSeq
#> 5 RPC deconv_RPC EpiDISH
#> description
#> 1 Regression-based deconvolution with Non-Negative Least Squares
#> 2 Regression-based deconvolution with CIBERSORT
#> 3 Regression-based deconvolution with MuSiC
#> 4 Regression-based deconvolution with weighted robust linear regression
#> 5 Regression-based deconvolution with robust partial correlations
list_deconv_refFree(show_description = T)
#> method function_to_call suggested_packages
#> 1 linseed deconv_refFree_linseed linseed
#> 2 debCAM deconv_refFree_debCAM debCAM
#> description
#> 1 Reference free deconvolution with linseed
#> 2 Reference free deconvolution with debCAM
The output from benchmarking_deconv()
function is a list
containing deconvolution results for each simulated bulk expression
names(deconvResults)
#> [1] "homo" "semi" "heter"
#> [4] "heter_sampleIDfree" "favilaco" "immunedeconv"
#> [7] "SCDC" "tcga_skcm"
Specifically, within each bulk expression, ground truth fraction and deconvolution results are stored in a list
names(deconvResults$homo)
#> [1] "deconvRes" "true_frac"
deconvRes
stores deconvolution results obtained from all
the evaluated methods, named using the format “deconvolution
category_deconvolution method”, and followed by reference generation
methods or updating methods when applicable.
names(deconvResults$homo$deconvRes)
#> [1] "MarkerBased_firstPC_limma" "MarkerBased_firstPC_scran"
#> [3] "MarkerBased_gsva_limma" "MarkerBased_gsva_scran"
#> [5] "MarkerBased_debCAM_limma" "MarkerBased_debCAM_scran"
#> [7] "MarkerBased_TOAST_limma" "MarkerBased_TOAST_scran"
#> [9] "RefBased_nnls_raw" "RefBased_nnls_limma"
#> [11] "RefBased_nnls_scran" "RefBased_cibersort_limma"
#> [13] "RefBased_cibersort_scran" "RefBased_MuSiC_raw"
#> [15] "RefBased_MuSiC_limma" "RefBased_MuSiC_scran"
#> [17] "RefBased_wRLM_limma" "RefBased_wRLM_scran"
#> [19] "RefBased_RPC_limma" "RefBased_RPC_scran"
#> [21] "RefFree_linseed" "RefFree_debCAM"
#> [23] "Bayesian_InstaPrism_initial" "Bayesian_InstaPrism_updatedMal"
#> [25] "Bayesian_InstaPrism_updatedAll" "mcp_counter"
#> [27] "epic" "quantiseq"
#> [29] "abis" "consensus_tme"
#> [31] "estimate"
Note that MuSiC does not directly utilize the numeric values from the
generated reference matrices, only the gene-list from those reference
matrices are used as the ‘marker’ input. The
RefBased_MuSiC_raw
indicates that all genes will be used as
the ‘marker’ input, which is the default setting for MuSiC.
head(deconvResults$homo$deconvRes$RefBased_MuSiC_raw)
#> Mal T.CD8 CAF Macrophage NK T.CD4
#> simulated1 0.01389339 0.37244705 0.016777565 0.152799672 0.011397918 0.09545100
#> simulated2 0.17758368 0.34149168 0.055507699 0.003250587 0.011796850 0.11873110
#> simulated3 0.28983812 0.19693217 0.017852358 0.078655927 0.011706062 0.15124552
#> simulated4 0.69256470 0.07831563 0.003761517 0.005851545 0.001105379 0.05058576
#> simulated5 0.10862561 0.25300086 0.060397608 0.041044451 0.014977472 0.14071364
#> simulated6 0.20108719 0.12363975 0.064310758 0.015741896 0.028311708 0.15503925
#> T.cell B.cell Endo.
#> simulated1 0.05156056 0.2722830 0.01338986
#> simulated2 0.09948854 0.1788448 0.01330501
#> simulated3 0.06314639 0.1762775 0.01434597
#> simulated4 0.05632267 0.0784363 0.03305650
#> simulated5 0.14159196 0.1084621 0.13118633
#> simulated6 0.07064584 0.2643750 0.07684858
With the deconvolution results obtained, we can finally evaluate the
performance with the benchmarking_evalu()
function
performance = benchmarking_evalu(deconvResults)
#> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> finish performance evaluation for homo >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
#> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> finish performance evaluation for semi >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
#> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> finish performance evaluation for heter >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
#> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> finish performance evaluation for heter_sampleIDfree >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
#> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> finish performance evaluation for favilaco >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
#> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> finish performance evaluation for immunedeconv >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
#> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> finish performance evaluation for SCDC >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
#> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> finish performance evaluation for tcga_skcm >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Again, the performance
object is separated for each
simulated bulk expression
names(performance)
#> [1] "homo" "semi" "heter"
#> [4] "heter_sampleIDfree" "favilaco" "immunedeconv"
#> [7] "SCDC" "tcga_skcm"
For a straightforward assessment of the deconvolution performance in terms of per-cell type correlation, we can:
performance$homo$maxCor[,1:4]
#> MarkerBased_firstPC_limma MarkerBased_firstPC_scran
#> B.cell 0.99 0.99
#> CAF 0.94 0.95
#> Endo. 0.98 0.99
#> Macrophage 0.99 0.99
#> Mal 1.00 1.00
#> NK 0.59 0.63
#> T.CD4 NA 0.85
#> T.CD8 0.96 0.96
#> T.cell NA NA
#> MarkerBased_gsva_limma MarkerBased_gsva_scran
#> B.cell 0.88 0.89
#> CAF 0.86 0.88
#> Endo. 0.87 0.89
#> Macrophage 0.77 0.82
#> Mal 0.85 0.93
#> NK 0.58 0.59
#> T.CD4 NA 0.70
#> T.CD8 0.79 0.83
#> T.cell NA NA
in terms of per-cell type RMSE (root mean square error) value
performance$homo$RMSE[,1:4]
#> MarkerBased_firstPC_limma MarkerBased_firstPC_scran
#> B.cell 6.31 6.36
#> CAF 5.93 6.02
#> Endo. 5.97 6.47
#> Macrophage 6.30 6.75
#> Mal 6.72 6.66
#> NK 5.84 5.67
#> T.CD4 NaN 2.85
#> T.CD8 2.76 6.50
#> T.cell NaN NaN
#> MarkerBased_gsva_limma MarkerBased_gsva_scran
#> B.cell 6083.22 5415.68
#> CAF 5781.36 3963.63
#> Endo. 6252.74 3526.25
#> Macrophage 8589.70 5648.36
#> Mal 9018.31 5509.67
#> NK 7094.51 5884.23
#> T.CD4 NaN 8147.19
#> T.CD8 9071.36 8500.19
#> T.cell NaN NaN
NA
values suggest the associated deconvolution
performance statistics is not available. This situation may occur when
reference construction methods with default parameters fail to identify
specific cell-type information, such as insufficient genes passing the
log fold change threshold. In such cases, users can explore alternative
parameters for reference construction, employ different methods to
generate the reference, or explicitly label the specific cell type as
‘not available’ for this particular method.
Detailed performance of each method can be queried by
head(performance$homo$deconvEvalu$RefBased_MuSiC_raw$M)
#> cell_type true_frac maxCorName estimate
#> 1 B.cell 0.29711195 B.cell 0.2722830
#> 2 B.cell 0.19282543 B.cell 0.1788448
#> 3 B.cell 0.18178891 B.cell 0.1762775
#> 4 B.cell 0.08583150 B.cell 0.0784363
#> 5 B.cell 0.09275774 B.cell 0.1084621
#> 6 B.cell 0.27349112 B.cell 0.2643750
In the end, we provided gather_performance()
function to
summarize all the performance statistics in a long table, facilitating
further comparisons between deconvolution methods
summarized_performance = gather_performance(performance)
head(summarized_performance)
#> cell_type RMSE cor maxCorName method class
#> 1 B.cell 6.31 0.99 B.cell MarkerBased_firstPC_limma homo
#> 2 CAF 5.93 0.94 CAF MarkerBased_firstPC_limma homo
#> 3 Endo. 5.97 0.98 Endo. MarkerBased_firstPC_limma homo
#> 4 Macrophage 6.30 0.99 Macrophage MarkerBased_firstPC_limma homo
#> 5 Mal 6.72 1.00 Mal MarkerBased_firstPC_limma homo
#> 6 NK 5.84 0.59 NK MarkerBased_firstPC_limma homo
For exmaple, for a direct comparison of regression-based method under different simulation strategies:
# for simplicity, we control that the reference used by each method is the same
df = summarized_performance[summarized_performance$class %in% c('homo','semi','heter_sampleIDfree','heter'),]
df = df[df$method %in% c('RefBased_cibersort_scran','RefBased_nnls_scran','RefBased_MuSiC_scran','RefBased_RPC_scran','RefBased_wRLM_scran'),]
df$reg_method = str_extract(df$method, "(?<=_)[^_]+")
pd <- position_dodge(width = 0.1)
ggplot(df,aes(x = reorder(class,cor,median,decreasing = T),
y = cor, group = reg_method)) +
geom_line(position = pd,aes(color=reg_method)) +
geom_point(position = pd,aes(color=reg_method,shape = reg_method )) +
# scale_color_manual(values = c('#F46D43','#FDAE61',"#8DA0CB","#E78AC3","#A6D854"))+
guides(linetype = guide_legend("Transmission"))+
facet_wrap(~cell_type,scales = 'free',nrow = 2)+
theme_bw()+
theme(axis.text.x = element_text(angle = 45, hjust = 1,size = 7),
strip.text.x = element_text(size = 8),
axis.text.y = element_text(size = 7),
axis.title.x = element_text(size = 7),
axis.title.y = element_text(size = 7),
legend.text = element_text(size = 7),
legend.title = element_text(size = 7),
strip.background = element_rect(fill = 'white'))+
xlab('')+
ylab('pearson r')
In the above example, we demonstrated a global performance drop as the
heterogeneity level changed in the simulated bulk expression, with MuSiC
exhibiting the highest per-cell type prediction across all simulation
strategies.