Tutorial of deconvBenchmarking

Introduction

deconvBenchmarking is a versatile R package that facilitates comprehensive evaluation of deconvolution methods using bulk data simulated under various strategies. Built with a focus on realistic bulk simulation, this powerful tool enables un-biased benchmarking of deconvolution methods. It provides the following key features that allow for easy and efficient deconvolution benchmarking:

Bulk Simulation Collection: a curated collection of bulk simulation methods with varying underlying heterogeneity levels to closely mirror real biological variability, as well as simulation methods from previously published benchmarking works ((Avila Cobos et al. 2020); (Sturm et al. 2019)), and another methods published in the SCDC package (Dong et al. 2021). Additionally, the package comes with a set of visualization tools to assess the realism of the bulk simulations in terms of biological variance, pairwise similarity and gene correlations
Built-in Deconvolution Methods: The package incorporates a diverse selection of pre-implemented deconvolution methods, thoughtfully categorized into four distinct groups: reference-free, regression-based, marker-based, and Bayesian methods. Researchers can easily apply and compare these methods on the simulated bulk data.
Built-in reference construction: The package provides a collection of reference construction methods to create necessary input for regression-based and marker-based methods, which allows for combined evaluation of both deconvoltion methods and reference construction methods.
User-Friendly Framework: deconvBenchmarking offers an intuitive and user-friendly benchmarking framework for conducting deconvolution benchmarking experiments. It also offers the flexibility to include user-provided references and deconvolution methods for evaluation, allowing researchers to gain valuable insights into methods being tested.
TCGA Expression Cohorts Integration: The package also offers options to include TCGA expression cohorts as an additional dataset for deconvolution evaluation. Users can directly compare the pre-calculated purity estimates (Colaprico et al. 2016) with the results obtained from deconvolution analyses, allowing for assessment of deconvolution methods in the context of real-world data from large-scale cancer studies.

Getting started

Given the flexibility of the benchmarking framework, many packages are not included in the package dependencies. For complete accessibility to all the methods within the package, we suggest users to refer to the recommended package list and ensure that all the necessary packages are installed.

library(deconvBenchmarking)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> Loading required package: tidyr
#> Loading required package: tibble
#> Loading required package: Matrix
#> 
#> Attaching package: 'Matrix'
#> The following objects are masked from 'package:tidyr':
#> 
#>     expand, pack, unpack
#> Loading required package: pbapply
#> Loading required package: fitdistrplus
#> Loading required package: MASS
#> 
#> Attaching package: 'MASS'
#> The following object is masked from 'package:dplyr':
#> 
#>     select
#> Loading required package: survival
#> Loading required package: gtools
#> Loading required package: stringr
#> Loading required package: caret
#> Loading required package: ggplot2
#> Loading required package: lattice
#> 
#> Attaching package: 'caret'
#> The following object is masked from 'package:survival':
#> 
#>     cluster
#> Loading required package: Biobase
#> Loading required package: BiocGenerics
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:dplyr':
#> 
#>     combine, intersect, setdiff, union
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
#>     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
#>     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
#>     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
#>     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
#>     union, unique, unsplit, which.max, which.min
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.
#> Loading required package: data.table
#> 
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#> 
#>     between, first, last
#> Loading required package: reshape2
#> 
#> Attaching package: 'reshape2'
#> The following objects are masked from 'package:data.table':
#> 
#>     dcast, melt
#> The following object is masked from 'package:tidyr':
#> 
#>     smiths

Load example dataset

Below we will utilize a publicly available single-cell RNA-seq expression dataset to investigate both bulk simulation strategies and the benchmarking framework offered in the in deconvBenchmarking package. Specifically, this scRNA seq data is from melanoma patients with GEO accession number: GSE115978 (Jerby-Arnon et al. 2018), the same dataset we used in our benchmarking work (Hu and Chikina 2023).

pdata = fread('https://ftp.ncbi.nlm.nih.gov/geo/series/GSE115nnn/GSE115978/suppl/GSE115978_cell.annotations.csv.gz')
tpm = fread('https://ftp.ncbi.nlm.nih.gov/geo/series/GSE115nnn/GSE115978/suppl/GSE115978_tpm.csv.gz')

tpm = tpm %>% column_to_rownames('V1')
tpm_unlog=2^tpm-1 # the expression was originally in log scale, unlog it

pdata = pdata %>% column_to_rownames('cells')

core_cells = rownames(pdata)[pdata$cell.types!='?'] # remove unknown cell-types
tpm_unlog = tpm_unlog[,core_cells]
pdata = pdata[core_cells,]

Explore the basic information of the single cell data. Specifically, we are interested in

whether the scRNA is in non-log scale (non-log scale is required)
cell type annotations for each cell
sample-ID information for each cell (indicating which cells belong to the same biological samples, optional)

head(colSums(tpm_unlog)) # scRNA data with library normalization already
#>                   cy78_CD45_neg_1_B04_S496_comb 
#>                                           1e+05 
#>        cy79_p4_CD45_neg_PDL1_neg_E11_S1115_comb 
#>                                           1e+05 
#>                            CY88_5_B10_S694_comb 
#>                                           1e+05 
#> cy79_p1_CD45_neg_PDL1_pos_AS_C1_R1_F07_S67_comb 
#>                                           1e+05 
#>                   cy78_CD45_neg_3_H06_S762_comb 
#>                                           1e+05 
#> cy79_p1_CD45_neg_PDL1_pos_AS_C1_R1_G01_S73_comb 
#>                                           1e+05

head(pdata) # cell.type and patient_id information available for each cell
#>                                                 samples cell.types
#> cy78_CD45_neg_1_B04_S496_comb                     Mel78        Mal
#> cy79_p4_CD45_neg_PDL1_neg_E11_S1115_comb          Mel79        Mal
#> CY88_5_B10_S694_comb                              Mel88        Mal
#> cy79_p1_CD45_neg_PDL1_pos_AS_C1_R1_F07_S67_comb   Mel79        Mal
#> cy78_CD45_neg_3_H06_S762_comb                     Mel78        Mal
#> cy79_p1_CD45_neg_PDL1_pos_AS_C1_R1_G01_S73_comb   Mel79        Mal
#>                                                 treatment.group Cohort
#> cy78_CD45_neg_1_B04_S496_comb                    post.treatment Tirosh
#> cy79_p4_CD45_neg_PDL1_neg_E11_S1115_comb        treatment.naive Tirosh
#> CY88_5_B10_S694_comb                             post.treatment Tirosh
#> cy79_p1_CD45_neg_PDL1_pos_AS_C1_R1_F07_S67_comb treatment.naive Tirosh
#> cy78_CD45_neg_3_H06_S762_comb                    post.treatment Tirosh
#> cy79_p1_CD45_neg_PDL1_pos_AS_C1_R1_G01_S73_comb treatment.naive Tirosh
#>                                                 no.of.genes no.of.reads
#> cy78_CD45_neg_1_B04_S496_comb                          8258      357919
#> cy79_p4_CD45_neg_PDL1_neg_E11_S1115_comb               2047        5727
#> CY88_5_B10_S694_comb                                   5375      139218
#> cy79_p1_CD45_neg_PDL1_pos_AS_C1_R1_F07_S67_comb        5648       73996
#> cy78_CD45_neg_3_H06_S762_comb                          7409      380341
#> cy79_p1_CD45_neg_PDL1_pos_AS_C1_R1_G01_S73_comb        6988       92485

table(pdata$cell.types) # 9 cell types annotated
#> 
#>     B.cell        CAF      Endo. Macrophage        Mal         NK      T.CD4 
#>        818        106        104        420       2018         92        856 
#>      T.CD8     T.cell 
#>       1759        706

Bulk RNA simulation

Step 1. Create a cell type fraction matrix

To guide bulkSimulator with predetermined cell type fractions for aggregation, we offered a useful fracSimulator_Beta() function. This function enables realistic cell-type fraction simulation using beta distribution. Use help(fracSimulator_Beta) for more details
In this tutorial we will create a fraction weight table containing 100 samples. The diagnostic plot below illustrates that the fraction we simulated for the malignant cells (Mal) retains a comparable distribution to that of the input scRNA data.

simulated_frac = fracSimulator_Beta(scMeta = pdata,
                                    n = 100,
                                    colnames_of_sample = 'samples',
                                    colnames_of_cellType = 'cell.types',
                                    fixed_cell_type = 'Mal',
                                    min.frac = 0.01)

Step 2. Bulk simulation from single cell data

We provided a useful bulkSimulator() function for bulk simulation with various strategies. The method argument enables the user to select from 7 distinct simulation methods we provide.

Available bulk simulation methods can be queried with list_bulkSimulator() function. Specifically, stepping through homo, semi, and heter (or heter_sampleIDfree) simulation, the heterogeneity level inside samples is increasing with the final heter being very similar to real biological bulk samples. These simulated bulk data can be effectively employed to assess the impact of heterogeneity on various deconvolution methods. The other three methods are sourced from previously published benchmarking work or resources that offer bulk simulation capabilities.

list_bulkSimulator() also returns the corresponding function names to execute each method separately, as well as packages required for each method.

list_bulkSimulator()
#>               method                 function_to_call suggested_packages
#> 1               homo               bulkSimulator_homo                   
#> 2               semi               bulkSimulator_semi                   
#> 3              heter              bulkSimulator_heter                   
#> 4 heter_sampleIDfree bulkSimulator_heter_sampleIDfree              scran
#> 5           favilaco           bulkSimulator_favilaco                   
#> 6       immunedeconv       bulkSimulator_immunedeconv       immunedeconv
#> 7               SCDC               bulkSimulator_SCDC               SCDC

In the following example, we are building a a list of bulk-simulation objects generated from all 7 methods provided. Note that the ‘favilaco’ method does not require a predetermined fraction weight table; instead, it generates the weights directly during simulation. Similarly, the ‘SCDC’ method only supports a predefined fraction weight table when all cell types are present in all biological samples; otherwise, it will raise errors. For both of these methods, we will only pass the nbulk parameter to determine the number of samples to be generated.

Use help(bulkSimulator) for more detailed information about the parameters and their usage.

simulated.res = bulkSimulator(methods = c('homo','semi','heter','heter_sampleIDfree','favilaco','immunedeconv','SCDC'),
                              scExpr = tpm_unlog,
                              scMeta = pdata,
                              colnames_of_cellType = 'cell.types',
                              colnames_of_sample = 'samples',
                              simulated_frac = simulated_frac,
                              # argument for 'semi' and 'heter_sampleIDfree' only
                              heter_cell_type = 'Mal',   
                              
                              # argument for 'heter_sampleIDfree' only
                              dirichlet_cs_par = 0.05, min.subcluster.size = 10,  
                              
                              # argument for 'favilaco' and 'SCDC' only
                              nbulk = 100,  
                              n.core = 16)

The simulated bulk expression are stored in a list, named by the simulation method

names(simulated.res)
#> [1] "homo"               "semi"               "heter"             
#> [4] "heter_sampleIDfree" "favilaco"           "immunedeconv"      
#> [7] "SCDC"

Within each simulation object, there is both simulated bulk expression and the corresponding cell-type fraction table

names(simulated.res$homo)
#> [1] "simulated_bulk" "simulated_frac"

simulated.res$homo$simulated_bulk[1:5,1:5]
#>           simulated1  simulated2  simulated3  simulated4  simulated5
#> C9orf152  0.03225649  0.02270095  0.01231601  0.01023282  0.01867693
#> ELMO2    50.51284714 58.55703064 45.01181022 41.86111511 49.28709699
#> CREB3L1   1.08347499  1.67154708  1.19301488  0.31713822  1.86339398
#> PNMA1    35.83870873 37.36863668 33.80917117 45.02049255 33.71307638
#> MMP2     21.27101158 65.80600681 26.85940844 13.29984195 60.39544863

head(simulated.res$homo$simulated_frac)
#>              Mal     B.cell Macrophage          NK      T.CD4      T.CD8
#> [1,] 0.008170341 0.29711195 0.14014202 0.003288018 0.08627478 0.32420716
#> [2,] 0.176263659 0.19282543 0.01311334 0.005873293 0.18420626 0.32091902
#> [3,] 0.295717683 0.18178891 0.08257291 0.019307961 0.16306604 0.18260040
#> [4,] 0.726467826 0.08583150 0.01048725 0.001040942 0.04611261 0.07531340
#> [5,] 0.088333855 0.09275774 0.04901411 0.013796195 0.06896494 0.23462041
#> [6,] 0.192535295 0.27349112 0.01856578 0.042621953 0.18103902 0.09587791
#>          T.cell        Endo.        CAF
#> [1,] 0.10398426 0.0105950536 0.02622641
#> [2,] 0.03960296 0.0008726330 0.06632340
#> [3,] 0.04621689 0.0049599745 0.02376922
#> [4,] 0.04262230 0.0002124054 0.01191176
#> [5,] 0.24217009 0.1521129673 0.05822970
#> [6,] 0.04193278 0.0801674524 0.07376870

Step 3. Evaluation of simulated data

In our recent benchmarking work ((Hu and Chikina 2023)), we highlighted the importance of realistic bulk simulation and revealed how it can affect the deconvolution results for various methods. In this package, we provided a set of visualization tools to assess the realism of the bulk simulations in terms of biological variance, pairwise similarity and gene correlations.

To start with, we need a baseline bulk expression profile to compare with. For this purpose, we can employ the aggregated cells from the same patient as an approximation of real biological data, as it encompasses the necessary sample-level variance. We provided a simulator_Statistics() function to calculate the summary statistics for later visual assessment.

simulator_Statistics_res = simulator_Statistics(bulkSimulator_res = simulated.res,
                                       baseline_expr = deconvBenchmarking:::build_ref_matrix(Expr = tpm_unlog,cell_type_labels = pdata$samples),
                                       n.core = 16)

The resulting simulator_Statistics_res object comprises a list of summary statistics for each simulated bulk expression and the baseline expression. By default, gene-correlation and sample-correlation are calculated using the top 300 variable genes from the baseline expression. The simulator_Statistics function also allows for user-defined genes for comparison.
Use help(simulator_Statistics) for more details.

names(simulator_Statistics_res)
#> [1] "summary_stats"      "gene_correlation"   "sample_correlation"

Specifically, the summary_stats table contains gene-level summary statistics

head(simulator_Statistics_res$summary_stats)
#>         mean          var         cv        MAD frac_of_zero     gene class
#> 1 0.02366835 0.0002092568 0.61118379 0.01540419         0.03 C9orf152  homo
#> 2 5.55192595 0.0517970677 0.04099293 0.22218668         0.00    ELMO2  homo
#> 3 0.83144059 0.4493479952 0.80623228 0.73748906         0.07  CREB3L1  homo
#> 4 5.25374046 0.0723455290 0.05119614 0.23455803         0.00    PNMA1  homo
#> 5 4.93805293 0.8272436203 0.18418785 0.96870915         0.00     MMP2  homo
#> 6 5.26328424 0.0997377636 0.06000300 0.34329713         0.00  TMEM216  homo

For direct comparison on gene level variance

plot_variance_comparison(simulator_Statistics_res,desired_order =  c('homo','favilaco','immunedeconv','semi','heter_sampleIDfree','heter','SCDC')) +
  viridis::scale_color_viridis(limits=c(0,12000))

Pairwise-similarities between samples

plot_PairwiseSampleCorrelation(simulator_Statistics_res)

Gene-correlations
By default, the top 300 most variable genes from the baseline expression are used to calculate gene-correlations.

plot_PairwiseGeneCorrelation(simulator_Statistics_res,nrow_panel = 1,desired_order =  c('baseline','homo','favilaco','immunedeconv','semi','heter_sampleIDfree','heter','SCDC'))

In addition, we provide a simulator_averageCV() function that can summarize variance level over biological pathways. This function can either take a gene-list provided by the user or directly retrieve the hallmark gene list from the msigdbr package (note that the msigdbr package must be installed for this feature to work).

hallmark_variance = simulator_averageCV(simulator_Statistics_res)
#> genelist not provided, will retrieve gene sets from msigdbr collections (default category: hallmark gene sets)

With this information, we can compare variance at the pathway level.

library(ComplexHeatmap)
library(circlize) 
avg_cv_wide=spread(hallmark_variance,class,avg_cv) %>% column_to_rownames('genelist_name')
h = Heatmap(avg_cv_wide %>% as.matrix(),
        col=colorRamp2(seq(0,0.65,0.05),viridis::viridis(14)),
        column_names_gp = grid::gpar(fontsize = 8),
        row_names_gp = grid::gpar(fontsize = 7),
        show_row_dend = F,
        heatmap_legend_param = list(title = "Average CV", title_gp = gpar(fontsize = 8), 
                                    labels_gp = gpar(fontsize = 8),
                                    legend_direction = "horizontal"))
draw(h,heatmap_legend_side = "bottom")

additional notes on `bulkSimulator()` parameters

So far we have showed the basic usage of the bulkSimulator() function, mostly with the default parameters. In real practice, some of the default parameters can be customized to align with specific needs and preferences. Below we provide an overview of the essential parameters for each simulation method. Note that all arguments for each method are integrated with the overarching bulkSimulator() function.

homo: Homogeneous bulk simulation, single cells are aggregated randomly within each cell type, regardless of where they are from. ncells_perSample parameter determines number of cells to pool together to generate a simulated bulk sample. Use help(bulkSimulator_homo) for details.
semi: Semi-heterogeneous bulk simulation, upon single cell aggregation, cells from the heter_cell_type are constrained to originate from the same biological sample, where for other cell-types, single cells are aggregated randomly without any consideration of their origin. use_chunk determines whether all the cells or a random portion of cells will be aggregated for the heter_cell_type. Use help(bulkSimulator_semi) for details.
heter: Heterogeneous bulk simulation, each cell-type component in the simulated bulk sample is constrained to originating from the same biological sample. use_chunk determines whether all the cells or a random portion of cells will be aggregated, and min_chunkSize indicates the minimum number of cells to aggregate for each cell type. Use help(bulkSimulator_heter) for details.
heter_sampleIDfree: sampleID-independent heterogeneous bulk simulation, each cell-type component in the simulated bulk sample is constrained to originating from the same sub-cluster. Specifically, the simulated bulk sample comprises a single sub-cluster component for the heter_cell_type, while for the remaining cell type components, it includes random number of sub-clusters. max.num.cs determines the maximum number of sub-clusters to include for a given cell type. If users have pre-defined sub-cluster information for each cell type in the metadata, they can specify the colnames_of_subcluster parameter to indicate this information. Otherwise, sub-clustering information for each cell type will be automatically generated with the scran::quickCluster() function, with the min.subcluster.size parameter being passed to scran::quickCluster() as the min.size parameter. dirichlet_cs_par is another hyper-parmeter determining the dispersion level of sub-cluster fractions. Use help(bulkSimulator_heter_sampleIDfree) for details.

Parameters associated with other bulk simulation methods can be queried with help() functions, or refer to the source packge of these methods

favilaco: source code
immunedeconv: source code
SCDC: source code

advanced practice: explore `bulkSimulator()` parameters

In the above section we have presented essential parameters associated with each method. For easy comparison and visual understanding of how these parameters affect the simulation process, we offer the following examples.

Consider bulk simulation with the heter_sampleIDfree method where we input different min.subcluster.size and max.num.cs parameters.

heter_sampleIDfree_sim1 = bulkSimulator_heter_sampleIDfree(scExpr = tpm_unlog,
                              scMeta = pdata,
                              colnames_of_cellType = 'cell.types',
                              simulated_frac = simulated_frac,
                              heter_cell_type = 'Mal',  
                              dirichlet_cs_par = 0.05, 
                              min.subcluster.size = 20)

heter_sampleIDfree_sim2 = bulkSimulator_heter_sampleIDfree(scExpr = tpm_unlog,
                              scMeta = pdata,
                              colnames_of_cellType = 'cell.types',
                              simulated_frac = simulated_frac,
                              heter_cell_type = 'Mal',  
                              dirichlet_cs_par = 0.05, 
                              min.subcluster.size = 10,
                              max.num.cs = 3)

sim_obj = list(heter_sampleIDfree_sim1 = heter_sampleIDfree_sim1,
               heter_sampleIDfree_sim2 = heter_sampleIDfree_sim2)

Get summary statistics for each simulated bulk expression

sim_obj_Statistics = simulator_Statistics(bulkSimulator_res = sim_obj,
                                       baseline_expr = deconvBenchmarking:::build_ref_matrix(Expr = tpm_unlog,cell_type_labels = pdata$samples))
#> Calculating summary statistics: mean, variance, cv and MAD
#> Calculating gene correlations
#> Calculating sample correlations

Visualization of the summary statistics:
Pairwise-similarities between samples

plot_PairwiseSampleCorrelation(sim_obj_Statistics)

Comparison of gene-level variance with respect to baseline expression.

plot_variance_comparison(sim_obj_Statistics) +
  viridis::scale_color_viridis(limits=c(0,12000))

Gene-correlations

plot_PairwiseGeneCorrelation(sim_obj_Statistics,nrow_panel = 1,desired_order = c('baseline','heter_sampleIDfree_sim1','heter_sampleIDfree_sim2'))

In the above example, we showed that heter_sampleIDfree_sim2 showed relatively higher variance in terms of pairwise similarities between samples. This observation is reasonable since heter_sampleIDfree_sim2 employs a smaller min.subcluster.size and max.num.cs, indicating that it aggregates over more distinct sub-clusters within each cell type. As a consequence, the simulated samples exhibit higher variance.

Benchmarking framework

The curated collection of bulk simulation methods presented above is seamlessly integrated into our benchmarking framework, enabling comprehensive evaluation of benchmarking framework under various simulation strategies.

Step1. Initiate a benchmarking object

We provide a useful benchmarking_init() function that takes scRNA profile as input and generate an object intended for future deconvolution benchmarking, which contains:

a list of training/testing cells
a list of simulated bulk object and/or tcga expression
a list of cell-type specific markers (for marker-based deconvolution)
a list of signature matrices (for regression-based deconvolution)

benchmarking_obj = benchmarking_init(scExpr = tpm_unlog,
                  scMeta = pdata,
                  colnames_of_cellType = 'cell.types',
                  colnames_of_sample = 'samples',

                  # argument for fracSimulator_Beta()
                  nbulk = 100,
                  fixed_cell_type = 'Mal',

                  # argument for training/testing splitting:
                  training_ratio = 0.5,
                  split_by = "cell",

                  # arguments for bulk simulation
                  bulkSimulator_methods = c('homo','semi','heter','heter_sampleIDfree','favilaco','immunedeconv','SCDC'),

                  # argument for semi/heter_sampleIDfree bulk simulation method
                  heter_cell_type = 'Mal',
                  dirichlet_cs_par = 0.05,
                  min.subcluster.size = 10,

                  # argument for marker constructions
                  refMarkers_methods = c('limma','scran'),

                  # arguments for signature matrices construction
                  refMatrix_methods = c('raw','limma','scran'),

                  # arguments to include tcga
                  include_tcga = T,
                  tcga_abbreviation = 'SKCM',
                  purity_methods =  c('ESTIMATE', 'ABSOLUTE', 'LUMP', 'IHC', 'CPE'))

Specifically, benchmarking_init() is an integrated function that incorporates the functionalities of the following individual functions, aggregating their respective parameters. For details about each parameter set linked to these functions, use help(benchmarking_init) or call help() on each individual function.

create_train_test_splitting(): Training/testing cells splitting
fracSimulator_Beta(): Simulate realistic cell-type fractions from beta distribution
bulkSimulator(): Bulk simulation function
refMarkers(): Obtain cell-type specific markers from scRNA reference
refMatrix(): Construct signature matrix from scRNA reference
pre_refMatrix_autogeneS(): Prepare input for autogeneS
pre_refMatrix_cibersortx(): Prepare input for cibersortx server

names(benchmarking_obj)
#> [1] "splitting"           "bulk_simulation_obj" "marker_list"        
#> [4] "sigMarix_list"

By setting include_tcga = T, the function will automatically download TCGA expression profile from xena browser and retrieve tcga purity estimate from the TCGAbiolinks package and GDC. This information can later be compared with the deconvolution results as additional evaluation for the deconvolution performance.

names(benchmarking_obj$bulk_simulation_obj)
#> [1] "homo"               "semi"               "heter"             
#> [4] "heter_sampleIDfree" "favilaco"           "immunedeconv"      
#> [7] "SCDC"               "tcga_SKCM"

head(benchmarking_obj$bulk_simulation_obj$tcga_SKCM$simulated_frac)
#>                  ESTIMATE ABSOLUTE   LUMP  IHC    CPE ABSOLUTE_GDC
#> TCGA-EE-A2GJ-06A   0.7449       NA 0.7161 0.95 0.7012         0.70
#> TCGA-EE-A2GI-06A   0.7761       NA 0.5858 0.95 0.7065         0.54
#> TCGA-WE-A8ZM-06A   0.9227       NA 0.9652 0.80 0.9055         0.85
#> TCGA-DA-A1IA-06A   0.9457       NA 0.9153 0.95 0.9352         0.94
#> TCGA-D3-A51H-06A   0.4881       NA 0.1088 0.75 0.3154         0.10
#> TCGA-XV-A9VZ-01A   0.9383       NA 0.8616 0.75 0.8721         0.67

To list the supported reference construction methods:

list_refMarix(show_description = T)
#>       method     function_to_call suggested_packages
#> 1        raw        refMatrix_raw                   
#> 2      limma      refMatirx_limma              limma
#> 3      scran      refMatirx_scran         BayesPrism
#> 4 markerList refMatirx_markerList                   
#>                                                               description
#> 1   Construct signature matrix using average expression across cell types
#> 2 Construct signature matrix using markers derived from limma DE analysis
#> 3 Construct signature matrix using markers derived from scran DE analysis
#> 4            Construct signature matrix with a pre-calculated marker list

list_refMarkers(show_description = T)
#>          method         function_to_call suggested_packages
#> 1         limma         refMarkers_limma              limma
#> 2         scran         refMarkers_scran         BayesPrism
#> 3 sigMatrixList refMarkers_sigMatrixList                   
#>                                                           description
#> 1            Obtain cell-type specific markers with limma DE analysis
#> 2            Obtain cell-type specific markers with scran DE analysis
#> 3 Obtain cell-type specific markers from a list of signature matrices

Step 2. Perform deconvolution on the benchmarking_obj

We offered a valuable benchmarking_deconv() function for deconvolution on the benchmarking_obj with user-selected methods. Specifically, benchmarking_deconv() is pre-implemented with deconvolution methods categorized from four distinct groups: reference-free, regression-based, marker-based, and Bayesian methods. Moreover, we provided the option to run deconvolution methods provided from the immunedeconv package as well.

deconvResults = benchmarking_deconv(benchmarking_obj,
                                    marker_based_methods = c('firstPC','gsva','debCAM','TOAST'),
                                    
                                    # arguments for regression-based methods
                                    regression_based_methods = c('nnls','cibersort','MuSiC','wRLM','RPC'),
                                    cibersort_path = './', # argument for cibersort method only
                                    scExpr = tpm_unlog,  # arguments for MuSiC method only
                                    scMeta = pdata,
                                    colnames_of_cellType = 'cell.types',
                                    colnames_of_sample = 'samples',
                                    
                                    refFree_methods = c('linseed','debCAM'),
                                    
                                    # arguments for Bayesian-based methods
                                    Bayesian_methods = 'InstaPrism', 
                                    key = 'malignant',  # this argument togther with 'colnames_of_sample' is highly recommended to run Bayesian based methods
                                    
                                    immunedeconv_methods = c('mcp_counter','epic','quantiseq','abis','consensus_tme','estimate'),
                                    tcga_abbreviation = 'skcm' 
                                    )

Parameters associated with individual categories of deconvolution methods can be queried using help() on the following functions:

deconv_Bayesian(): Bayesian-based deconvolution methods, available methods include BayesPrism and InstaPrism
deconv_marker(): Marker-based deconvolution methods, available methods include gsva, debCAM, TOAST, and firstPC (first principal component score)
deconv_regression(): Regression-based deconvolution, available methods include nnls, cibersort (need to provide path to CIBERSORT.R), MuSiC, wRLM and RPC
deconv_refFree(): Reference free deconvolution methods, available methods include linseed and debCAM

To list available deconvolution methods and the dependent packages required for their execution:

list_deconv_Bayesian(show_description = T)
#>       method           function_to_call suggested_packages
#> 1 InstaPrism deconv_Bayesian_InstaPrism         InstaPrism
#> 2 BayesPrism deconv_Bayesian_BayesPrism         BayesPrism
#>                                    description
#> 1 Bayesian-based deconvolution with InstaPrism
#> 2 Bayesian-based deconvolution with BayesPrism

list_deconv_marker(show_description = T)
#>    method      function_to_call suggested_packages
#> 1 firstPC deconv_marker_firstPC         FactoMineR
#> 2    gsva    deconv_marker_gsva               GSVA
#> 3  debCAM  deconv_marker_debCAM             debCAM
#> 4   TOAST   deconv_marker_TOAST              TOAST
#>                                                                       description
#> 1 Marker-based deconvolution with first principal score (abundance estimate only)
#> 2                  Marker-based deconvolution with GSVA (abundance estimate only)
#> 3                          Marker-based deconvolution with debCAM::AfromMarkers()
#> 4                                Marker-based deconvolution with TOAST::MDeconv()

list_deconv_regression(show_description = T)
#>      method            function_to_call            suggested_packages
#> 1      nnls      deconv_regression_nnls                          nnls
#> 2 cibersort deconv_regression_cibersort e1071/parallel/preprocessCore
#> 3     MuSiC     deconv_regression_MuSiC    MuSiC/SingleCellExperiment
#> 4      wRLM      deconv_regression_wRLM                   LinDeconSeq
#> 5       RPC                  deconv_RPC                       EpiDISH
#>                                                             description
#> 1        Regression-based deconvolution with Non-Negative Least Squares
#> 2                         Regression-based deconvolution with CIBERSORT
#> 3                             Regression-based deconvolution with MuSiC
#> 4 Regression-based deconvolution with weighted robust linear regression
#> 5       Regression-based deconvolution with robust partial correlations

list_deconv_refFree(show_description = T)
#>    method       function_to_call suggested_packages
#> 1 linseed deconv_refFree_linseed            linseed
#> 2  debCAM  deconv_refFree_debCAM             debCAM
#>                                 description
#> 1 Reference free deconvolution with linseed
#> 2  Reference free deconvolution with debCAM

The output from benchmarking_deconv() function is a list containing deconvolution results for each simulated bulk expression

names(deconvResults)
#> [1] "homo"               "semi"               "heter"             
#> [4] "heter_sampleIDfree" "favilaco"           "immunedeconv"      
#> [7] "SCDC"               "tcga_skcm"

Specifically, within each bulk expression, ground truth fraction and deconvolution results are stored in a list

names(deconvResults$homo)
#> [1] "deconvRes" "true_frac"

deconvRes stores deconvolution results obtained from all the evaluated methods, named using the format “deconvolution category_deconvolution method”, and followed by reference generation methods or updating methods when applicable.

names(deconvResults$homo$deconvRes)
#>  [1] "MarkerBased_firstPC_limma"      "MarkerBased_firstPC_scran"     
#>  [3] "MarkerBased_gsva_limma"         "MarkerBased_gsva_scran"        
#>  [5] "MarkerBased_debCAM_limma"       "MarkerBased_debCAM_scran"      
#>  [7] "MarkerBased_TOAST_limma"        "MarkerBased_TOAST_scran"       
#>  [9] "RefBased_nnls_raw"              "RefBased_nnls_limma"           
#> [11] "RefBased_nnls_scran"            "RefBased_cibersort_limma"      
#> [13] "RefBased_cibersort_scran"       "RefBased_MuSiC_raw"            
#> [15] "RefBased_MuSiC_limma"           "RefBased_MuSiC_scran"          
#> [17] "RefBased_wRLM_limma"            "RefBased_wRLM_scran"           
#> [19] "RefBased_RPC_limma"             "RefBased_RPC_scran"            
#> [21] "RefFree_linseed"                "RefFree_debCAM"                
#> [23] "Bayesian_InstaPrism_initial"    "Bayesian_InstaPrism_updatedMal"
#> [25] "Bayesian_InstaPrism_updatedAll" "mcp_counter"                   
#> [27] "epic"                           "quantiseq"                     
#> [29] "abis"                           "consensus_tme"                 
#> [31] "estimate"

Note that MuSiC does not directly utilize the numeric values from the generated reference matrices, only the gene-list from those reference matrices are used as the ‘marker’ input. The RefBased_MuSiC_raw indicates that all genes will be used as the ‘marker’ input, which is the default setting for MuSiC.

head(deconvResults$homo$deconvRes$RefBased_MuSiC_raw)
#>                   Mal      T.CD8         CAF  Macrophage          NK      T.CD4
#> simulated1 0.01389339 0.37244705 0.016777565 0.152799672 0.011397918 0.09545100
#> simulated2 0.17758368 0.34149168 0.055507699 0.003250587 0.011796850 0.11873110
#> simulated3 0.28983812 0.19693217 0.017852358 0.078655927 0.011706062 0.15124552
#> simulated4 0.69256470 0.07831563 0.003761517 0.005851545 0.001105379 0.05058576
#> simulated5 0.10862561 0.25300086 0.060397608 0.041044451 0.014977472 0.14071364
#> simulated6 0.20108719 0.12363975 0.064310758 0.015741896 0.028311708 0.15503925
#>                T.cell    B.cell      Endo.
#> simulated1 0.05156056 0.2722830 0.01338986
#> simulated2 0.09948854 0.1788448 0.01330501
#> simulated3 0.06314639 0.1762775 0.01434597
#> simulated4 0.05632267 0.0784363 0.03305650
#> simulated5 0.14159196 0.1084621 0.13118633
#> simulated6 0.07064584 0.2643750 0.07684858

Step3. Evaluate deconvolution performance

With the deconvolution results obtained, we can finally evaluate the performance with the benchmarking_evalu() function

performance = benchmarking_evalu(deconvResults)
#> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> finish performance evaluation for homo >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
#> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> finish performance evaluation for semi >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
#> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> finish performance evaluation for heter >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
#> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> finish performance evaluation for heter_sampleIDfree >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
#> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> finish performance evaluation for favilaco >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
#> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> finish performance evaluation for immunedeconv >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
#> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> finish performance evaluation for SCDC >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
#> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> finish performance evaluation for tcga_skcm >>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Again, the performance object is separated for each simulated bulk expression

names(performance)
#> [1] "homo"               "semi"               "heter"             
#> [4] "heter_sampleIDfree" "favilaco"           "immunedeconv"      
#> [7] "SCDC"               "tcga_skcm"

For a straightforward assessment of the deconvolution performance in terms of per-cell type correlation, we can:

performance$homo$maxCor[,1:4]
#>            MarkerBased_firstPC_limma MarkerBased_firstPC_scran
#> B.cell                          0.99                      0.99
#> CAF                             0.94                      0.95
#> Endo.                           0.98                      0.99
#> Macrophage                      0.99                      0.99
#> Mal                             1.00                      1.00
#> NK                              0.59                      0.63
#> T.CD4                             NA                      0.85
#> T.CD8                           0.96                      0.96
#> T.cell                            NA                        NA
#>            MarkerBased_gsva_limma MarkerBased_gsva_scran
#> B.cell                       0.88                   0.89
#> CAF                          0.86                   0.88
#> Endo.                        0.87                   0.89
#> Macrophage                   0.77                   0.82
#> Mal                          0.85                   0.93
#> NK                           0.58                   0.59
#> T.CD4                          NA                   0.70
#> T.CD8                        0.79                   0.83
#> T.cell                         NA                     NA

in terms of per-cell type RMSE (root mean square error) value

performance$homo$RMSE[,1:4]
#>            MarkerBased_firstPC_limma MarkerBased_firstPC_scran
#> B.cell                          6.31                      6.36
#> CAF                             5.93                      6.02
#> Endo.                           5.97                      6.47
#> Macrophage                      6.30                      6.75
#> Mal                             6.72                      6.66
#> NK                              5.84                      5.67
#> T.CD4                            NaN                      2.85
#> T.CD8                           2.76                      6.50
#> T.cell                           NaN                       NaN
#>            MarkerBased_gsva_limma MarkerBased_gsva_scran
#> B.cell                    6083.22                5415.68
#> CAF                       5781.36                3963.63
#> Endo.                     6252.74                3526.25
#> Macrophage                8589.70                5648.36
#> Mal                       9018.31                5509.67
#> NK                        7094.51                5884.23
#> T.CD4                         NaN                8147.19
#> T.CD8                     9071.36                8500.19
#> T.cell                        NaN                    NaN

NA values suggest the associated deconvolution performance statistics is not available. This situation may occur when reference construction methods with default parameters fail to identify specific cell-type information, such as insufficient genes passing the log fold change threshold. In such cases, users can explore alternative parameters for reference construction, employ different methods to generate the reference, or explicitly label the specific cell type as ‘not available’ for this particular method.

Detailed performance of each method can be queried by

head(performance$homo$deconvEvalu$RefBased_MuSiC_raw$M)
#>   cell_type  true_frac maxCorName  estimate
#> 1    B.cell 0.29711195     B.cell 0.2722830
#> 2    B.cell 0.19282543     B.cell 0.1788448
#> 3    B.cell 0.18178891     B.cell 0.1762775
#> 4    B.cell 0.08583150     B.cell 0.0784363
#> 5    B.cell 0.09275774     B.cell 0.1084621
#> 6    B.cell 0.27349112     B.cell 0.2643750

Visualization of the deconvolution performance

In the end, we provided gather_performance() function to summarize all the performance statistics in a long table, facilitating further comparisons between deconvolution methods

summarized_performance = gather_performance(performance)
head(summarized_performance)
#>    cell_type RMSE  cor maxCorName                    method class
#> 1     B.cell 6.31 0.99     B.cell MarkerBased_firstPC_limma  homo
#> 2        CAF 5.93 0.94        CAF MarkerBased_firstPC_limma  homo
#> 3      Endo. 5.97 0.98      Endo. MarkerBased_firstPC_limma  homo
#> 4 Macrophage 6.30 0.99 Macrophage MarkerBased_firstPC_limma  homo
#> 5        Mal 6.72 1.00        Mal MarkerBased_firstPC_limma  homo
#> 6         NK 5.84 0.59         NK MarkerBased_firstPC_limma  homo

For exmaple, for a direct comparison of regression-based method under different simulation strategies:

# for simplicity, we control that the reference used by each method is the same 
df = summarized_performance[summarized_performance$class %in% c('homo','semi','heter_sampleIDfree','heter'),]
df = df[df$method %in% c('RefBased_cibersort_scran','RefBased_nnls_scran','RefBased_MuSiC_scran','RefBased_RPC_scran','RefBased_wRLM_scran'),]
df$reg_method = str_extract(df$method, "(?<=_)[^_]+")

pd <- position_dodge(width = 0.1)
ggplot(df,aes(x = reorder(class,cor,median,decreasing = T), 
                                     y = cor, group = reg_method)) +
  geom_line(position = pd,aes(color=reg_method)) +
  geom_point(position = pd,aes(color=reg_method,shape = reg_method )) +
  # scale_color_manual(values = c('#F46D43','#FDAE61',"#8DA0CB","#E78AC3","#A6D854"))+
  guides(linetype = guide_legend("Transmission"))+
  facet_wrap(~cell_type,scales = 'free',nrow = 2)+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1,size = 7),
        strip.text.x = element_text(size = 8),
        axis.text.y = element_text(size = 7),
        axis.title.x = element_text(size = 7),
        axis.title.y = element_text(size = 7),
        legend.text = element_text(size = 7),
        legend.title = element_text(size = 7),
        strip.background = element_rect(fill = 'white'))+
  xlab('')+
  ylab('pearson r')

In the above example, we demonstrated a global performance drop as the heterogeneity level changed in the simulated bulk expression, with MuSiC exhibiting the highest per-cell type prediction across all simulation strategies.

Reference

Avila Cobos, Francisco, José Alquicira-Hernandez, Joseph E. Powell, Pieter Mestdagh, and Katleen De Preter. 2020. “Benchmarking of cell type deconvolution pipelines for transcriptomics data.” Nature Communications 11 (1): 5650. https://doi.org/10.1038/s41467-020-19015-1.

Colaprico, Antonio, Tiago C. Silva, Catharina Olsen, Luciano Garofano, Claudia Cava, Davide Garolini, Thais S. Sabedot, et al. 2016. “TCGAbiolinks: An r/Bioconductor Package for Integrative Analysis of TCGA Data.” Nucleic Acids Research 44 (8): e71. https://doi.org/10.1093/nar/gkv1507.

Dong, Meichen, Aatish Thennavan, Eugene Urrutia, Yun Li, Charles M Perou, Fei Zou, and Yuchao Jiang. 2021. “SCDC: Bulk Gene Expression Deconvolution by Multiple Single-Cell RNA Sequencing References.” Briefings in Bioinformatics 22 (1): 416–27. https://doi.org/10.1093/bib/bbz166.

Hu, Mengying, and Maria Chikina. 2023. “Heterogeneous Pseudobulk Simulation Enables Realistic Benchmarking of Cell-Type Deconvolution Methods.” http://dx.doi.org/10.1101/2023.01.05.522919.

Jerby-Arnon, Livnat, Parin Shah, Michael S. Cuoco, Christopher Rodman, Mei-Ju Su, Johannes C. Melms, Rachel Leeson, et al. 2018. “A Cancer Cell Program Promotes T Cell Exclusion and Resistance to Checkpoint Blockade.” Cell 175 (4): 984–997.e24. https://doi.org/10.1016/j.cell.2018.09.006.

Sturm, Gregor, Francesca Finotello, Florent Petitprez, Jitao David Zhang, Jan Baumbach, Wolf H. Fridman, Markus List, and Tatsiana Aneichyk. 2019. “Comprehensive evaluation of transcriptome-based cell-type quantification methods for immuno-oncology.” Bioinformatics (Oxford, England) 35 (14): i436–45. https://doi.org/10.1093/bioinformatics/btz363.

Table of Contents

Tutorial of deconvBenchmarking

Mengying Hu

2023-07-31

Introduction

Getting started

Load example dataset

Bulk RNA simulation

Step 1. Create a cell type fraction matrix

Step 2. Bulk simulation from single cell data

Step 3. Evaluation of simulated data

additional notes on `bulkSimulator()` parameters

advanced practice: explore `bulkSimulator()` parameters

Benchmarking framework

Step1. Initiate a benchmarking object

Step 2. Perform deconvolution on the benchmarking_obj

Step3. Evaluate deconvolution performance

Visualization of the deconvolution performance

Reference

Table of Contents

Tutorial of deconvBenchmarking

Mengying Hu

2023-07-31

Introduction

Getting started

Load example dataset

Bulk RNA simulation

Step 1. Create a cell type fraction matrix

Step 2. Bulk simulation from single cell data

Step 3. Evaluation of simulated data

additional notes on bulkSimulator() parameters

advanced practice: explore bulkSimulator() parameters

Benchmarking framework

Step1. Initiate a benchmarking object

Step 2. Perform deconvolution on the benchmarking_obj

Step3. Evaluate deconvolution performance

Visualization of the deconvolution performance

Reference

additional notes on `bulkSimulator()` parameters

advanced practice: explore `bulkSimulator()` parameters