Tutorial of InstaPrism

Introduction

InstaPrism is R package to deconvolute cellular proportion and gene expression in bulk RNA-Seq data. Based on the same conceptual framework and corresponding generative mode from BayesPrism, InstaPrism re-implements BayesPrism in a derandomized framework by replacing the time-consuming Gibbs sampling steps in BayesPrism with a fixed-point algorithm, which greatly accelerated the calculation speed while maintaining highly comparable performance.

This tutorial is associated the latest version of InstaPrism (InstaPrism v0.1.5), which contains the following key modifications:

introduces a 2-dimensional scaler matrix that can be used to reconstruct any cell.type specific expression of interest, as a memory-efficient alternative to the original 3-d Z array
enables cell-type-specific expression deconvolution with the updated reference
adds build-in reference tailored for a wide range of cancer types
introduces get_Z_array() function to retrieve 3-d cell type specific expression array (sample \(\times\) genes \(\times\) cell.type)
introduces get_loglikelihood_trace() function to visualize model’s log likelihood over iterations
introduces get_subcluster() function to identify subcluster structures within each cell type, can be used to specify cell-state information
InstaPrism() function associated with previous tutorial (InstaPrism v0.1.3) can be accessed using the InstaPrism_legacy() function

In this tutorial, we provide three examples of running InstaPrism and compare the results with BayesPrism.

Getting started

Load the InstaPrism package.

InstaPrism works as an independent R package and does not require the users to have BayesPrism installed.
We recommend the readers to read through the tutorial of BayePrism before running our examples.
extdata folder associated with this tutorial can be downloaded from zenodo repository.

library(InstaPrism)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> Loading required package: tidyr
#> Loading required package: Matrix
#> 
#> Attaching package: 'Matrix'
#> The following objects are masked from 'package:tidyr':
#> 
#>     expand, pack, unpack
#> Loading required package: caret
#> Loading required package: ggplot2
#> Loading required package: lattice
#> Loading required package: NMF
#> Loading required package: pkgmaker
#> Loading required package: registry
#> Loading required package: rngtools
#> Loading required package: cluster
#> NMF - BioConductor layer [OK] | Shared memory capabilities [NO: bigmemory] | Cores 15/16
#>   To enable shared memory capabilities, try: install.extras('
#> NMF
#> ')
#> Loading required package: pheatmap
#> Loading required package: abind
#> Loading required package: pbapply

Example 1: deconvolution on a small simulated dataset

Step 1. Create simulated single cell and bulk expression data

The following lines simulate bulk and single-cell expression, as well as marker genes and true proportions that can be used as an example of deconvolution using the simulation model the BisqueRNA package.

library(Biobase)
library(BisqueRNA)
cell.types <- c("Neurons", "Astrocytes", "Oligodendrocytes", "Microglia", "Endothelial Cells","others")
avg.props <- c(.5, .2, .15, .07, .03,.05)
sim.data <- SimulateData(n.ind=20, n.genes=1000, n.cells=500, cell.types=cell.types, avg.props=avg.props)

sim.data contains the following simulated objects:

A 1000 \(\times\) 10000 single-cell expression object for 20 individuals, with cells annotated with inidivual ID and cell type labels.
A 1000 \(\times\) 20 bulk expression object for 20 individuals.
A 6 \(\times\) 20 matrix indicating cellular proportions for 20 individuals.
A dataframe indicating marker genes for each cell type

Alternatively, readers can load the example simulated data directly from the InstaPrism package.

data("sim.data")

Step 2. Prepare input for InstaPrism and BayesPrism

Both InstaPrism and BayesPrism take single cell profiles as prior information and return deconvolution results for a given bulk expression matrix. To prepare input for InstaPrism and BayesPrism, we need:

scExpr: a single-cell expression data as prior information (raw count or library size normalized expression, without log transformation and scaling)
bulk_Expr: a bulk expression to run deconvolution (raw count or library size normalized expression, without log transformation and scaling)
cell.type.labels: a character vector indicating cell types of each cell from the scRNA data
cell.state.labels: a character vector indicating cell states of each cell from the scRNA data

In real practice, cell.state.labels usually denote different cell states from a same given cell type. For example, malignant cells can be subclustered by different patients to denote different malignant states. With the sim.data we created above, we will generate some artificial cell state labels for Neurons given individual IDs.

library(Biobase)
sc.eset <- sim.data$sc.eset
bulk.eset <- sim.data$bulk.eset

sc_Expr = exprs(sc.eset)
sc_Expr = apply(sc_Expr,2,function(x)((x/sum(x))*1e+05)) 

bulk_Expr = exprs(bulk.eset)
bulk_Expr = apply(bulk_Expr,2,function(x)((x/sum(x))*1e+06)) 

cell_type_labels=sc.eset@phenoData@data[["cellType"]]

# create artifical cell-state labels for Neuron cells: Neuron_A, Neuron_B, Neuron_C
Neurons_states=names(table(sc.eset@phenoData@data$SubjectName))
names(Neurons_states)=c(rep("A",8),rep("B",6),rep("C",6))

cell_state_labels=ifelse(sc.eset@phenoData@data[["cellType"]]=='Neurons',paste0('Neurons_sub',names(Neurons_states)[match(sc.eset@phenoData@data$SubjectName,Neurons_states)]),sc.eset@phenoData@data[["cellType"]])

table(cell_state_labels)
#> cell_state_labels
#>        Astrocytes Endothelial Cells         Microglia      Neurons_subA 
#>              1958               306               727              2014 
#>      Neurons_subB      Neurons_subC  Oligodendrocytes            others 
#>              1497              1477              1445               576

Step 3. Run deconvolution with InstaPrism and BayesPrism

InstaPrism() is the main deconvolution function. Depending the format of single cell prior information to use, users need to specify the input_type argument passed to the function. The InstaPrism() function currently support four input_types.

Note that in the following examples we will consider a simple implementation where we use prior information from the scRNA Seq directly, without the need to update reference.

1. InstaPrism with input_type = 'raw'

Under this setting, InstaPrism takes the same input format as BayesPrism:

scExpr: a single-cell expression data as prior information (raw count or CPM normalized expression)
bulk_Expr: a bulk expression to run deconvolution (raw count or TPM normalized expression)
cell.type.labels: a character vector indicating cell types of each cell from the scRNA data
cell.state.labels: a character vector indicating cell states of each cell from the scRNA data

library(tictoc) # load package to measure execution time
tic()
InstaPrism.res = InstaPrism(input_type = 'raw',sc_Expr = sc_Expr,bulk_Expr = bulk_Expr,
                    cell.type.labels = cell_type_labels,cell.state.labels = cell_state_labels)

elapsed1 = toc()
#> 0.161 sec elapsed

2. InstaPrism with input_type = 'prism'

In this mode, users need to first construct a prism object using the BayesPrism package, and then pass the prismObj to the InstaPrism() function.

library(BayesPrism)
#> Loading required package: snowfall
#> Loading required package: snow
#> Warning: replacing previous import 'gplots::lowess' by 'stats::lowess' when
#> loading 'BayesPrism'
#> Warning: replacing previous import 'BiocParallel::register' by 'NMF::register'
#> when loading 'BayesPrism'
tic()
bp.obj = new.prism(reference = t(sc_Expr),input.type = 'count.matrix',
                   cell.type.labels = cell_type_labels, cell.state.labels = cell_state_labels,
                   mixture = t(bulk_Expr),key = NULL)
InstaPrism.res2 = InstaPrism(input_type = 'prism',prismObj = bp.obj)

elapsed2 = toc()
#> 0.72 sec elapsed

3. InstaPrism with input_type = 'refPhi' (recommended)

Under this setting, users need to pass a pre-defined refPhi object containing single cell prior information. To build a refPhi object, users can use the refPrepare() function with the following arguments:

scExpr: a single-cell expression data as prior information (raw count or CPM normalized expression)
cell.type.labels: a character vector indicating cell types of each cell from the scRNA data
cell.state.labels: a character vector indicating cell states of each cell from the scRNA data

The resulting refPhi object contains the following information:

phi.cs: cell state specific reference
phi.ct: cell type specific reference (this information is only required for BayesPrism based reference update)
map: a list with mapping information from cell.states to cell.types

The advantage of using refPhi as input is that one doesn’t need to load the large single cell expression matrix every time, and the refPhi object can be easily re-implemented once built.

refPhi_obj = refPrepare(sc_Expr, cell_type_labels, cell_state_labels)
tic()
InstaPrism.res3 = InstaPrism(input_type = 'refPhi',bulk_Expr = bulk_Expr,refPhi = refPhi_obj)

elapsed3 = toc()
#> 0.049 sec elapsed

4. InstaPrism with input_type = 'refPhi_cs' (recommended)

Under this setting, users need to specify a pre-defined refPhi_cs object which can be built by

refPhi_cs_obj = refPrepare(sc_Expr, cell_type_labels, cell_state_labels,ref_type = 'refPhi_cs')
InstaPrism.res = InstaPrism(input_type = 'refPhi_cs',bulk_Expr = bulk_Expr,refPhi = refPhi_cs_obj)

Alternatively, users can use the built-in reference provided by InstaPrism, which are all in refPhi_cs format.

Using refPhi_cs provides a more versatile setting where users can selectively choose which cell-states to include for deconvolution. For example, given the refPhi object we built in the previous section, if we only want to include Neurons_subA and Neurons_subB as cell states of interest, and remove Neurons_subC from Neurons, we can:

refPhi_cs_map = refPhi_obj@map
refPhi_cs_map$Neurons = c('Neurons_subA', 'Neurons_subB' )
to_move_id = which(colnames(refPhi_obj@phi.cs) == 'Neurons_subC')
refPhi_cs_obj = new('refPhi_cs',phi.cs = refPhi_obj@phi.cs[,-to_move_id],map = refPhi_cs_map)

InstaPrism.res4 = InstaPrism(input_type = 'refPhi_cs',bulk_Expr = bulk_Expr,refPhi_cs = refPhi_cs_obj)

InstaPrism() function returns an S4 object containing the following information:

Post.ini.cs: posterior information for cell.states
- theta: cell.state fraction estimates for each individual
Post.ini.ct: posterior information for cell.types
- theta: cell.types fraction estimates for each individual
map: a list with mapping information from cell.states to cell.types
initial.reference: an S4 object with scRNA derived reference matrix at cell.type and cell.state levels
initial.scaler: a gene \(\times\) sample matrix that can be used to reconstruct cell.state/cell.type specific expression

Run BayesPrism for comparison

In the following example, we are setting update.gibbs = F for consistency with our previous examples where we only consider single-cell based reference, without updating the reference. Deconvolution with BayesPrism can be called by a single function run.prism() as follows:

start.time = Sys.time()
bp.res = run.prism(bp.obj,update.gibbs = F)
end.time=Sys.time()
bp_running_time = difftime(end.time, start.time,units = 'secs') %>% as.numeric()
save(bp.res,bp_running_time,file='extdata/tutorial_example1/bp.res.initial.RData')

Step 4. Compare deconvolution results: InstaPrism vs BayesPrism

1. deconvolution performance of InstaPrism

We provide a useful function deconv_performance_plot() that visualize correlations at per-cell.type level. According to the plot, InstaPrism accurately predicts cellular proportions from sim.data

deconv_performance_plot(est = InstaPrism.res@Post.ini.ct@theta,true = sim.data$props,title = 'InstaPrism performance on sim.data',nrow=1)
#> Loading required package: ggpmisc
#> Loading required package: ggpp
#> 
#> Attaching package: 'ggpp'
#> The following object is masked from 'package:ggplot2':
#> 
#>     annotate

2. compare cell type fraction estimates from InstaPrism and BayesPrism

We can visualize the correlation between two fraction estimates with a heatmap. As shown by the heatmap, cell.type estimates from both methods are highly correlated.

corr=cor(t(InstaPrism.res@Post.ini.ct@theta),bp.res@posterior.initial.cellType@theta)
diag(corr)
#>           Neurons        Astrocytes  Oligodendrocytes         Microglia 
#>         0.9999964         0.9999965         0.9999892         0.9999696 
#> Endothelial Cells            others 
#>         0.9999649         0.9999805

ComplexHeatmap::Heatmap(corr,show_row_dend = F,show_column_dend = F,column_title = 'BayesPrism',row_title = 'InstaPrism',name='correlation')

3. running time comparison InstaPrism significantly accelerates the deconvolution speed in either modes.

InstaPrism_mode1_running_time = as.numeric(elapsed1$toc-elapsed1$tic)
InstaPrism_mode2_running_time = as.numeric(elapsed2$toc-elapsed2$tic)
InstaPrism_mode3_running_time = as.numeric(elapsed3$toc-elapsed3$tic)

rt=data.frame(method=c('InstaPrism.raw.mode','InstaPrism.prism.mode','InstaPrism.refPhi.mode','BayesPrism'),
              time=c(InstaPrism_mode1_running_time,InstaPrism_mode2_running_time,
                     InstaPrism_mode3_running_time,bp_running_time))
rt$time=round(rt$time,2)
ggplot(rt,aes(method,time))+
  geom_bar(stat="identity",fill='grey',width = 0.5)+
  theme_bw()+
  xlab('')+
  ylab('time (secs)')+
  geom_text(aes(label = time), vjust = -0.2)+
  ylim(0,max(rt$time)*1.05)+
  theme(axis.text.x = element_text(angle = 45, hjust=1))

Step 5. Reconstruct cell.type specific expression

The 2-dimensional initial.scaler matrix from the InstaPrism output can be used to reconstruct any cell.states/cell.types specific expression of interest, which serves as an extra memory-efficient alternative to the 3-dimensional Z array in the BayesPrism output.

To reconstruct Neurons specific expression from the above example:

Neuron_expr = reconstruct_Z_ct_initial(InstaPrism_obj = InstaPrism.res,
                                       cell.type.of.interest = 'Neurons')

And BayesPrism derived Neurons specific expression can be obtained by:

Neuron_expr_bp = t(bp.res@posterior.initial.cellType@Z[,,'Neurons'])

Two Neurons specific expression matrices are near identical.

corr = cor(Neuron_expr,Neuron_expr_bp)
ComplexHeatmap::Heatmap(corr,show_row_dend = F,show_column_dend = F,column_title = 'BayesPrism',row_title = 'InstaPrism',name='correlation',cluster_rows = F,cluster_columns = F,show_row_names = F,show_column_names = F)

InstaPrism also supports full 3d Z tensor reconstruction (sample \(\times\) genes \(\times\) cell.type)

Z = get_Z_array(InstaPrism.res,resolution = 'ct',n.core = 16)

The dimension of Z is equivalent to BayesPrism output

dim(Z)
#> [1]  20 988   6

dim(bp.res@posterior.initial.cellType@Z)
#> [1]  20 988   6

Two Z tensors are nearly identical

all.equal(Z,bp.res@posterior.initial.cellType@Z)
#> [1] "Mean relative difference: 0.001901318"

Step 6 (optional). Run deconvolution with InstaPrism and BayesPrism using the updated reference

We now consider a more advanced deconvolution problem where we want to leverage from the information shared by bulk samples and update the reference matrix accordingly.

In InstaPrism, we present a more versatile InstaPrism_update() function for a similar purpose, with the following arguments included:

cell.types.to.update: a user-defined vector indicating which (non-malignant) cell.types to update. By default, this parameter is set to ‘all,’ indicating the updating of all cell type references, following the same approach as utilized in BayesPrism.
key name of the malignant cell type. Upon setting the key parameter, the updated malignant reference will be unique for each individual. Set to NA if there is no malignant cells in the problem, and the updated reference will be the same for all the individuals
keep.phi: either ‘phi.ct’ or ‘phi.cs’, denoting whether using phi.ct or phi.cs for cell.types that are not updated.

InstaPrism_update() returns an S4 object containing the following objects:

theta: cell state/type abundance matrix
psi_mal: a gene \(\times\) sample reference matrix for malignant cells, where each column represents the malignant reference for a corresponding sample; the psi_mal will be a NA matrix if no malignant cells present in the problem
psi_env: reference matrix for the non-malignant cells
scaler: a gene \(\times\) sample matrix to reconstruct any cell.type specific expression of interest
map: a list with mapping information from cell.states to cell.types
updated.cell.types a vector indicating cell.types with updated reference; equals to NULL if no environmental (non-malignant) cell.types being updated
key: the key argument passed to InstaPrism_update() function
keep.phi: the keep.phi argument passed to InstaPrism_update() function

Among these objects, the ‘theta’ matrix exclusively holds the deconvolution results of interest, while the remaining objects are only required for Z reconstruction.

1. Reference update using InstaPrism_update()

To update the reference profile for the above simulated dataset, we will set key = NA since no malignant cells are present in this problem. By setting cell.types.to.update = 'all', we specify that all the cell.type reference will be updated.

tic()
InstaPrism.res.updated = InstaPrism_update(InstaPrism.res,
                                           bulk_Expr,
                                           cell.types.to.update = 'all',
                                           key = NA)
#> the "key" parameter is set to NA, the updated reference will be the same for all the individuals

elp = toc()
#> 0.058 sec elapsed

2. Reference update with BayesPrism

To enable reference update using BayesPrism, one can set update.gibbs = TRUE in run.prism() function (the default setting), or by calling the update.theta() function on a BayesPrism object returned from run.prism() function with update.gibbs = FALSE.

For a direct running time comparing between reference update steps, we will first run run.prism() function with update.gibbs = FALSE and then use the update.theta() function.

bp.res = run.prism(bp.obj,update.gibbs = F)

# we will only consider running time for reference updates
start.time = Sys.time() 
bp.res.updated = update.theta(bp=bp.res)

# the bp.res.updated obtained by executing run.prism() + update.theta() steps mentioned above is equivalent to running the following command in a single line 
# bp.res.updated = run.prism(bp.obj,update.gibbs = T) 

end.time=Sys.time()
bp_updated_running_time = difftime(end.time, start.time,units = 'secs') %>% as.numeric()
save(bp.res.updated,bp_updated_running_time,file = 'extdata/tutorial_example1/bp.res.updated.RData')

3. deconvolution results comparison

When we set cell.types.to.update equals to all the unique cell.types present in the reference profile, two deconvolution is near identical.

corr=cor(t(InstaPrism.res.updated@theta),bp.res.updated@posterior.theta_f@theta)
diag(corr)
#>           Neurons        Astrocytes  Oligodendrocytes         Microglia 
#>         0.9999949         0.9999941         0.9999872         0.9999680 
#> Endothelial Cells            others 
#>         0.9999705         0.9999546

4. running time comparison

InstaPrism_update() demonstrates a substantial acceleration in reference update speed compared to BayesPrism::update.theta()

InstaPrism_updated_running_time = as.numeric(elp$toc - elp$tic)
rt=data.frame(method=c('InstaPrism','BayesPrism'),
              time=c(InstaPrism_updated_running_time,bp_updated_running_time))
rt$time=round(rt$time,2)
ggplot(rt,aes(method,time))+
  geom_bar(stat="identity",fill='grey',width = 0.4)+
  theme_bw()+
  ylab('time (secs)')+
  geom_text(aes(label = time), vjust = -0.2)+
  ylim(0,max(rt$time)*1.05)

5. reconstruct cell.type specific expression with the updated reference

We provide a useful function reconstruct_Z_ct_updated() that enables the output of cell-type-specific expression using the updated reference.
For example, if we are interested in Neurons specific expression with the updated reference, we can

Neuron_expr_updated = reconstruct_Z_ct_updated(InstaPrism_updated_obj = InstaPrism.res.updated,
                                               cell.type.of.interest = 'Neurons')

Alternatively, one can use get_Z_array() function to get the 3d Z array directly. The dimension of this Z array is sample \(\times\) genes \(\times\) cell.type.

Z = get_Z_array(InstaPrism.res.updated,resolution = 'ct',n.core = 16)

6. additional notes on InstaPrism_update function

In the above examples we only considered a simple scenario where we update all the cell.types present. In real practice, we may want to update only specific cell.types, such as those exhibiting high cross-sample heterogeneity, while retaining the scRNA based reference for the remaining cell types. In such cases, users need to specify the keep.phi argument, indicating whether to use phi.ct or phi.cs for cell.types that are not updated.

Since either phi.ct or phi.cs can be used for the cell.types that are not being updated, the theta matrix can be a mixture of abundance estimates for both cell.types and cell.states. To merge any cell.state information to cell.type levels, users can utilize the convenient merge_updated_theta() function for this purpose. Note that is function is only applicable when keep.phi = 'phi.cs'

updated_ct = merge_updated_theta(InstaPrism.res.updated)

Example 2: deconvolution on the tutorial data from BayesPrism

In this example, we will use the tutorial data provided in BayesPrism and compare the deconvolution results between InstaPrism and BayesPrism.

Step 1. Run BayesPrism following the tutorial in BayesPrism.

Note that it takes more than 6 hours to run the following code from our end (using n.core=16), users can skip the BayesPrism running process by loading our processed results directly.

library(BayesPrism)
load('extdata/tutorial_example2/tutorial.gbm.rdata')
sc.stat <- plot.scRNA.outlier(
  input=sc.dat, #make sure the colnames are gene symbol or ENSMEBL ID 
  cell.type.labels=cell.type.labels,
  species="hs", #currently only human(hs) and mouse(mm) annotations are supported
  return.raw=TRUE #return the data used for plotting. 
  #pdf.prefix="gbm.sc.stat" specify pdf.prefix if need to output to pdf
)
bk.stat <- plot.bulk.outlier(
  bulk.input=bk.dat,#make sure the colnames are gene symbol or ENSMEBL ID 
    sc.input=sc.dat, #make sure the colnames are gene symbol or ENSMEBL ID 
  cell.type.labels=cell.type.labels,
  species="hs", #currently only human(hs) and mouse(mm) annotations are supported
  return.raw=TRUE
  #pdf.prefix="gbm.bk.stat" specify pdf.prefix if need to output to pdf
)
# Filter outlier genes from scRNA-seq data
sc.dat.filtered <- cleanup.genes (input=sc.dat,
                                  input.type="count.matrix",
                                    species="hs", 
                                    gene.group=c( "Rb","Mrp","other_Rb","chrM","MALAT1","chrX","chrY") ,
                                    exp.cells=5)
# Subset protein coding genes
sc.dat.filtered.pc <-  select.gene.type (sc.dat.filtered,
                                        gene.type = "protein_coding")
# construct a prism object
myPrism <- new.prism(
  reference=sc.dat.filtered.pc, 
  mixture=bk.dat,
  input.type="count.matrix", 
  cell.type.labels = cell.type.labels, 
  cell.state.labels = cell.state.labels,
  key="tumor",
  outlier.cut=0.01,
    outlier.fraction=0.1,
)
# run BayesPrism
start.time = Sys.time()
bp.res <- run.prism(prism = myPrism, n.cores=16,update.gibbs=T) # set update.gibbs=T for full comparison between two methods
end.time=Sys.time()
bp_running_time = difftime(end.time, start.time,units = 'mins') %>% as.numeric()
save(bp.res,bp_running_time,file = 'extdata/tutorial_example2/bp.res.RData')

Alternatively, users can download the processed results directly from zenodo repository

load('extdata/tutorial_example2/bp.res.RData')

Step 2. Deconvolution with InstaPrism

We present a step-by-step InstaPrism deconvolution framework using the following commands:

tic()
# 1. get initial deconvolution results using scRNA based reference
InstaPrism.res.initial = InstaPrism(input_type = 'prism',prismObj = bp.res@prism, n.iter = 100, n.core = 16)

# 2. reconstruct cell.type specific expression with the initial reference (optional)
Z_ct_initial = get_Z_array(InstaPrism.res.initial,resolution = 'ct',n.core = 16)

# 3. update reference & get deconovlution results from the updated reference
updated_obj = InstaPrism_update(InstaPrism.res.initial, 
                                bulk_Expr = t(bp.res@prism@mixture), 
                                n.iter = 100, 
                                n.core = 16, 
                                cell.types.to.update = 'all', 
                                key = 'tumor')

# 4. reconstruct cell.type specific expression with the updated reference (optional)
Z_ct_updated = get_Z_array(updated_obj, resolution = 'ct', n.core = 16)

elp = toc()
#> 76.866 sec elapsed

Step 3. Compare deconvolution results

1. cell.type fraction comparison (with initial reference)

cell.type fraction estimates from two methods are nearly identical

ct.corr=cor(t(InstaPrism.res.initial@Post.ini.ct@theta),bp.res@posterior.initial.cellType@theta) %>% round(1)

ComplexHeatmap::Heatmap(ct.corr,show_row_dend = F,show_column_dend = F,column_title = 'BayesPrism',row_title = 'InstaPrism',name='correlation', cell_fun = function(j, i, x, y, w, h, col) {grid::grid.text(ct.corr[i, j], x, y)})

2. cell type specific expression comparison

Two tumor-specific expression matrix are near identical

# tumor specific expression from BayesPrism
Z_tumor_ct_initial_bp = bp.res@posterior.initial.cellType@Z[,,'tumor']
all.equal(Z_ct_initial[,,'tumor'],Z_tumor_ct_initial_bp)
#> [1] "Mean relative difference: 0.005304617"

3. cell.type fraction comparison (with updated reference)

cell.type fraction estimates from two methods are nearly identical

ct.corr.updated=cor(t(updated_obj@theta),bp.res@posterior.theta_f@theta)
diag(ct.corr.updated)
#>       tumor     myeloid    pericyte endothelial       tcell       oligo 
#>   0.9999576   0.9999839   0.9999233   0.9998641   0.9950594   0.9999056

4. running time comparison InstaPrism significantly reduced the running time from hours to mere minutes on this tutorial data.

InstaPrism_running_time = as.numeric(elp$toc - elp$tic) / 60 # tic and toc are the elapsed process times in seconds, divide it by 60 to convert to minutes
rt=data.frame(method=c('InstaPrism \n (n.core=16)','BayesPrism \n (n.core=16)'),
              time=c(InstaPrism_running_time,bp_running_time))
rt$time=round(rt$time,2)
ggplot(rt,aes(method,time))+
  geom_bar(stat="identity",fill='grey',width = 0.5)+
  theme_bw()+
  ylab('time (mins)')+
  geom_text(aes(label = time), vjust = -0.2)+
  ylim(0,max(rt$time)*1.08)

5. Memory comparison With 2-d compression of the Z matrix, InstaPrism significantly reduce the memory required to store the deconvolution object

InstaPrism_memory = object.size(c(InstaPrism.res.initial,updated_obj))
bp_memory = object.size(bp.res)

mp=data.frame(method=c('InstaPrism','BayesPrism'),
              memory=c(object.size(c(InstaPrism.res.initial,updated_obj)),object.size(bp.res)),
              m = c('85MB','1.8GB'))

ggplot(mp,aes(method,memory))+
  geom_bar(stat="identity",fill='grey',width = 0.5)+
  theme_bw()+
  ylab('memory (bytes)')+
  geom_text(aes(label = m), vjust = -0.4)+
  ylim(0,2e+09)

Example 3: deconvolution on heterogeneouly simulated bulk data

We have recently proposed a heterogeneous bulk simulation pipeline that simulate bulk samples with realistic biological variance (for more details, check here). In the following example, we will run deconvolution on a heterogeneouly simulated bulk dataset of HNSC tumors and evaluate the deconvolution performance by comparing the fraction estimates with real fractions.

Step 1. Load example data

For details about how the bulk samples are simulated, check here.

load('extdata/tutorial_example3/example3.RData')

The example3.RData contains the following objects:

simulated_bulk: simulated bulk expression of HNSC tumors to run deconvolution
simulated_frac: simulated cell.type fractions for the bulk samples as ground truth
scExpr_train: sc-RNA data as prior information
cell.type.labels: a character vector indicating cell types of each cell from the scRNA data
cell.state.labels: a character vector indicating cell states of each cell from the scRNA data

where we assign the malignant cells with different cell.states by their patient IDs.

Step 2. Run InstaPrism

InstaPrism.res = InstaPrism(input_type = 'raw',sc_Expr = scExpr_train,bulk_Expr = simulated_bulk,
                            cell.type.labels = cell.type.labels, cell.state.labels = cell.state.labels, 
                            n.iter = 100,n.core = 16)

Step 3. Compare InstaPrism results with real cell.type fractions

deconv_performance_plot(est = InstaPrism.res@Post.ini.ct@theta,true = t(simulated_frac),title = 'InstaPrism performance on heterogenously simulated bulk data',nrow=2)

InstaPrism achieves reasonable cell.type estimates for the simulated bulk samples.

A complete InstaPrism deconvolution framework

We present the following framework to run deconvolution with InstaPrism

1. get initial deconvolution results using scRNA based reference

Depending on input types, initial deconvolution results can be obtained through:

InstaPrism.res.initial = InstaPrism(input_type = 'raw',sc_Expr,bulk_Expr,cell.type.labels,cell.state.labels,n.core = 16)
InstaPrism.res.initial = InstaPrism(input_type = 'prism',prismObj,n.core = 16)
InstaPrism.res.initial = InstaPrism(input_type = 'refPhi',bulk_Expr, refPhi,n.core = 16) 
InstaPrism.res.initial = InstaPrism(input_type = 'refPhi_cs',bulk_Expr, refPhi_cs,n.core = 16)

InstaPrism() function returns an S4 object containing the following information:

Post.ini.cs: posterior information for cell.states
- theta: cell.state fraction estimates for each individual
Post.ini.ct: posterior information for cell.types
- theta: cell.types fraction estimates for each individual
map: a list with mapping information from cell.states to cell.types
initial.reference: an S4 object with scRNA derived reference matrix at cell.type and cell.state levels
initial.scaler: a gene \(\times\) sample matrix that can be used to reconstruct cell.state/cell.type specific expression

2. reconstruct cell.state/cell.type specific expression with the initial reference (optional)

Z_cs_initial = reconstruct_Z_cs_initial(InstaPrism.res.initial,cell.state.of.interest)
Z_ct_initial = reconstruct_Z_ct_initial(InstaPrism.res.initial,cell.type.of.interest)

reconstruct_Z_cs_initial() and reconstruct_Z_ct_initial() returns a gene \(\times\) sample matrix containing cell-state or cell-type specific expression information.

Alternatively, use get_Z_array() to get the 3-d cell.state/cell.type specific expression:

Z_cs_initial = get_Z_array(InstaPrism.res.initial,resolution = 'cs')
Z_ct_initial = get_Z_array(InstaPrism.res.initial,resolution = 'ct')

3. update reference & get deconovlution results from the updated reference

updated_obj = InstaPrism_update(InstaPrism.res.initial, 
                                bulk_Expr,
                                cell.types.to.update = 'all', 
                                key = 'tumor',
                                n.core = 16)

updated_obj@theta contains cell state/type abundance estimates, while the remaining objects returned by InstaPrism_update() is for Z reconstruction only.

4. merge cell.state level theta to cell.type level (optional)

When updated_obj@theta contains cell.state abundance estimates, one can use merge_updated_theta() function to merge any cell.state information to cell.type levels.

updated_ct = merge_updated_theta(updated_obj)

5. reconstruct cell.type specific expression with the updated reference (optional)

Z_ct_updated = get_Z_array(updated_obj,n.core = 16)

Additional notes on InstaPrism parameters

So far we have demonstrated the primary usage of InstaPrism, mostly with the default parameters. In real practice, some of the default parameters can be tailored to meet the users’ specific requirements.

n.iter: number of iterations in InstaPrism algorithm.
n.iter is an important parameter for InstaPrism algorithm. InstaPrism updates the fraction estimates over iterations and n.iter determines how many iterations are performed. Under default setting, this value is set to max(100, number of cell.states \(\times\) 2). However, users can set it to larger values to improve convergence. Specifically, we included two additional parameters that help to verify the convergence status: verbose and convergence.plot.
verbose: a logical variable to determine whether to display convergence status of the model.
By setting verbose = T, InstaPrsim() will print the convergence status summary for both cell.states and cell.types.
Specifically, the absolute difference in fraction estimates between the last two iterations is utilized as an indicator of convergence (abs_diff), with smaller values indicating convergence and higher confidence of the deconvolution results. Using example3.RData as a demonstration, by setting verbose = T, we get a summarized table showing the minimum, median and maximum abs_diff for all cell.states/types across the samples. Usually we consider abs_diff < 0.01 as convergence.

load('extdata/tutorial_example3/example3.RData')
InstaPrism.res = InstaPrism(input_type = 'raw',sc_Expr = scExpr_train,bulk_Expr = simulated_bulk,
                            cell.type.labels = cell.type.labels, cell.state.labels = cell.state.labels, 
                            n.iter = 100,n.core = 16,verbose = T)
#> deconvolution with scRNA reference phi 
#> 
#>  
#>  ********************** convergence status summary ********************** 
#>  instructions: 
#>  the absolute difference in fraction estimates between the last two iterations is utilized as an indicator of convergence, 
#>  with smaller values indicating convergence (usually consider abs_diff < 0.01 as convergence), 
#>  below is a summarized convergence status for cell.states/cell.types across all the samples 
#>   
#>  *************** convergence status summary for cell states *************** 
#>     cell.state          min       median          max
#> 1       B cell 6.572045e-14 2.042905e-06 1.335874e-05
#> 2    Dendritic 1.257843e-07 4.954830e-06 2.266095e-05
#> 3  Endothelial 1.554095e-07 8.878092e-06 4.762081e-05
#> 4   Fibroblast 3.301680e-08 1.627585e-05 6.743852e-05
#> 5        HNSCC 1.278888e-07 2.426794e-05 1.275864e-04
#> 6      HNSCC10 1.091520e-08 1.591293e-06 1.144036e-05
#> 7      HNSCC12 2.139345e-09 5.206182e-07 2.872120e-05
#> 8      HNSCC13 5.232302e-07 2.301956e-05 8.867985e-05
#> 9      HNSCC16 9.318927e-08 2.984146e-05 2.215908e-04
#> 10     HNSCC17 1.417992e-07 2.709896e-05 3.314047e-04
#> 11     HNSCC18 5.775770e-10 3.036487e-07 6.254012e-05
#> 12     HNSCC20 5.802662e-07 1.003421e-05 1.443692e-04
#> 13     HNSCC22 1.080305e-08 2.707936e-05 4.096297e-04
#> 14     HNSCC24 3.619047e-09 1.656969e-06 2.719350e-05
#> 15     HNSCC25 1.389515e-08 2.383273e-05 2.361662e-04
#> 16     HNSCC26 1.013015e-07 2.077936e-05 7.958531e-05
#> 17     HNSCC28 2.987275e-08 3.044526e-05 2.738516e-04
#> 18      HNSCC5 2.497902e-08 6.200813e-06 1.489406e-04
#> 19      HNSCC6 2.433803e-07 1.457819e-05 6.872945e-05
#> 20      HNSCC7 1.243468e-09 3.923414e-06 3.197536e-05
#> 21      HNSCC8 2.316761e-13 3.827862e-07 6.602595e-06
#> 22  Macrophage 1.493925e-07 3.978596e-06 1.398273e-05
#> 23        Mast 5.757529e-11 1.759897e-07 2.972196e-06
#> 24     myocyte 9.239945e-10 2.153556e-07 1.784543e-06
#> 25      T cell 9.902686e-08 1.059614e-05 4.179099e-05
#> 
#>  *************** convergence status summary for cell types *************** 
#>     cell.type          min       median          max
#> 1      B cell 6.572045e-14 2.042905e-06 1.335874e-05
#> 2   Dendritic 1.257843e-07 4.954830e-06 2.266095e-05
#> 3 Endothelial 1.554095e-07 8.878092e-06 4.762081e-05
#> 4  Fibroblast 3.301680e-08 1.627585e-05 6.743852e-05
#> 5  Macrophage 1.493925e-07 3.978596e-06 1.398273e-05
#> 6   malignant 1.568093e-05 3.891631e-04 9.749845e-04
#> 7        Mast 5.757529e-11 1.759897e-07 2.972196e-06
#> 8     myocyte 9.239945e-10 2.153556e-07 1.784543e-06
#> 9      T cell 9.902686e-08 1.059614e-05 4.179099e-05
#> 
#>  ******************* end of convergence status summary ******************** 
#>  
#> scaler calculation

convergence.plot: a logical variable determining whether to visualize convergence status for cell types.
Again, using example3.RData as a demonstration, first let’s consider the scenario whenn.iter is set too low. According to the convergence plot, fraction estimates for malignant cells did not converge. This means that we should set higher value of n.iter.

InstaPrism.res = InstaPrism(input_type = 'raw',sc_Expr = scExpr_train,bulk_Expr = simulated_bulk,
                            cell.type.labels = cell.type.labels, cell.state.labels = cell.state.labels, 
                            n.iter = 20,n.core = 16,convergence.plot=T,max_n_per_plot = 100)
#> Warning in InstaPrism(input_type = "raw", sc_Expr = scExpr_train, bulk_Expr =
#> simulated_bulk, : fraction estimation didn't converge for some samples, enable
#> convergence.plot for details and try larger n.iter

Now, let’s try using a higher value of n.iter. This time we get fraction estimation for all the cell types converge (with all the abs_diff less than 0.01).

InstaPrism.res = InstaPrism(input_type = 'raw',sc_Expr = scExpr_train,bulk_Expr = simulated_bulk,
                            cell.type.labels = cell.type.labels, cell.state.labels = cell.state.labels, 
                            n.iter = 100,n.core = 16,convergence.plot=T,max_n_per_plot = 100)

n.core: number of cores to use for parallel programming. We recommend setting this parameter to higher value when multiple cores are available. InstaPrism functions that support parallel programming include: InstaPrism(), InstaPrism_update(), reconstruct_Z_ct_initial(), reconstruct_Z_ct_updated() and InstaPrism_legacy().

Table of Contents