PREDE - R package for partial reference-based deconvolution

Deconvolution of heterogeneous bulk tumor samples into distinct cell populations is an important yet challenging problem, particularly when only partial references are available. Here we developed PREDE, a partial reference deconvolution method based on iterative non-negative Matrix Factorization.

In [18]:
set.seed(123)

How to install?

1. Install the devtools package if needed

In [ ]:
install.packages("devtools")

2. Load the devtools package

In [19]:
library(devtools)

3. Install PREDE from GitHub

In [20]:
install_github("Xiaoqizheng/PREDE")
Skipping install of 'PREDE' from a github remote, the SHA1 (6d0083c8) has not changed since last install.
  Use `force = TRUE` to force installation

How to use?

In [21]:
library("PREDE")
In [22]:
## load data of lung cancer cell lines and select a number of cell lines as references
data(lung_exp)
W <- lung_exp[,1:6]
head(W)
A matrix: 6 × 6 of type dbl
A549_LUNGCAL12T_LUNGCALU1_LUNGCALU3_LUNGCALU6_LUNGCORL105_LUNG
X.13.405440 3.420208 3.345345 3.3113643.5365053.390355
HIF3A4.556388 4.406636 4.453265 4.5610664.6071094.140054
LOC1008599308.619415 8.099757 8.857715 9.0248487.8911628.416764
RNF173.754861 4.137410 3.743343 3.8373313.9032013.732548
RNF107.847993 7.400762 7.962722 7.3498977.0822058.037870
RNF119.49133810.06663011.26884010.5759609.3222829.839499

1. Generate the mixed samples

In [23]:
## generate bulk data using cell lines as reference
bulk <- generate_bulk(W,nSample =100,csd = 0.1)

2. Select the feature

In [24]:
## select top features in terms of coefficient of variation (cv) 
feat <- select_feature(mat = bulk$Y,method = "cv",nmarker = 1000,startn = 0)
In [25]:
head(feat)
  1. 'FDCSP'
  2. 'KRT23'
  3. 'CXCL5'
  4. 'GAGE3'
  5. 'KRT6A'
  6. 'BCHE'

3. Get optimal number of total cell types

In [26]:
## determine the total number of cell types by AIC, by specifying only partial reference W1
OptimalK <- GetCelltypeNum(bulk$Y[feat,],W=NULL,W1=W[feat,1:4],maxK = 10)
In [28]:
plot(5:10,OptimalK$AIC, col="red",xlab="Number of total cell types",
     ylab = "AIC",lwd = 1,type = 'b',main = "AIC")
abline(v = 6,lwd = 2,lty = 2,col = "gray")
In [29]:
## the optimal value of K with the lowest AIC 
OptimalK$K
6

4. Partial reference deconvolution (PREDE)

In [30]:
## Run PREDE with the optimal value of K
pred <- PREDE(bulk$Y[feat,],W1=W[feat,1:4],type = "GE",K=OptimalK$K,iters = 100,rssDiffStop=1e-5)
In [31]:
## Correlation in expression profiles between true and predicted cell types 
cor(W[feat,],pred$W)
A matrix: 6 × 6 of type dbl
A549_LUNGCAL12T_LUNGCALU1_LUNGCALU3_LUNG12
A549_LUNG1.00000000.46253150.49242600.36849120.42372520.4009461
CAL12T_LUNG0.46253151.00000000.40996290.38589530.35147260.3124635
CALU1_LUNG0.49242600.40996291.00000000.28846190.48608750.3620164
CALU3_LUNG0.36849120.38589530.28846191.00000000.28650350.4166320
CALU6_LUNG0.40845490.30778220.42974510.27040100.97779790.2454628
CORL105_LUNG0.43348710.38147350.44067330.43223410.34376570.9721933

The first four columns are four input known cell types. The last two columns are the predicted new cell types 1 and 2, which should be corresponding to cell lines 'CALU6_LUNG' and 'CORL105_LUNG' respectively.

In [32]:
## plot the accuracies for profile and proportion estimation
par(mar = c(3.5, 3, 1.6, 1.1), mgp = c(1.9, 0.5, 0),mfrow = c(2,2))
plot(W[feat,'CALU6_LUNG'],pred$W[,"1"],xlab = "True expression profile",pch = 19,col="#00000050",ylab = "Predicted expression profile",main = "CALU6")
plot(bulk$H[5,],pred$H["1",],xlab = "True proportion",pch = 3,col="red",ylab = "Predicted proportion",main = "CALU6")
plot(W[feat,'CORL105_LUNG'],pred$W[,"2"],xlab = "True expression profile",pch = 19,col="#00000050",ylab = "Predicted expression profile",main = "CORL105")
plot(bulk$H[6,],pred$H["2",],xlab = "True proportion",pch = 3,col="red",ylab = "Predicted proportion",main = "CORL105")
In [ ]: