PREDE - R package for partial reference-based deconvolution¶

Deconvolution of heterogeneous bulk tumor samples into distinct cell populations is an important yet challenging problem, particularly when only partial references are available. Here we developed PREDE, a partial reference deconvolution method based on iterative non-negative Matrix Factorization.

set.seed(123)

How to install?¶

1. Install the devtools package if needed¶

install.packages("devtools")

2. Load the devtools package¶

library(devtools)

3. Install PREDE from GitHub¶

install_github("Xiaoqizheng/PREDE")

Skipping install of 'PREDE' from a github remote, the SHA1 (6d0083c8) has not changed since last install.
  Use `force = TRUE` to force installation

How to use?¶

library("PREDE")

## load data of lung cancer cell lines and select a number of cell lines as references
data(lung_exp)
W <- lung_exp[,1:6]
head(W)

1. Generate the mixed samples¶

## generate bulk data using cell lines as reference
bulk <- generate_bulk(W,nSample =100,csd = 0.1)

2. Select the feature¶

## select top features in terms of coefficient of variation (cv) 
feat <- select_feature(mat = bulk$Y,method = "cv",nmarker = 1000,startn = 0)

head(feat)

3. Get optimal number of total cell types¶

## determine the total number of cell types by AIC, by specifying only partial reference W1
OptimalK <- GetCelltypeNum(bulk$Y[feat,],W=NULL,W1=W[feat,1:4],maxK = 10)

plot(5:10,OptimalK$AIC, col="red",xlab="Number of total cell types",
     ylab = "AIC",lwd = 1,type = 'b',main = "AIC")
abline(v = 6,lwd = 2,lty = 2,col = "gray")

## the optimal value of K with the lowest AIC 
OptimalK$K

4. Partial reference deconvolution (PREDE)¶

## Run PREDE with the optimal value of K
pred <- PREDE(bulk$Y[feat,],W1=W[feat,1:4],type = "GE",K=OptimalK$K,iters = 100,rssDiffStop=1e-5)

## Correlation in expression profiles between true and predicted cell types 
cor(W[feat,],pred$W)

The first four columns are four input known cell types. The last two columns are the predicted new cell types 1 and 2, which should be corresponding to cell lines 'CALU6_LUNG' and 'CORL105_LUNG' respectively.

## plot the accuracies for profile and proportion estimation
par(mar = c(3.5, 3, 1.6, 1.1), mgp = c(1.9, 0.5, 0),mfrow = c(2,2))
plot(W[feat,'CALU6_LUNG'],pred$W[,"1"],xlab = "True expression profile",pch = 19,col="#00000050",ylab = "Predicted expression profile",main = "CALU6")
plot(bulk$H[5,],pred$H["1",],xlab = "True proportion",pch = 3,col="red",ylab = "Predicted proportion",main = "CALU6")
plot(W[feat,'CORL105_LUNG'],pred$W[,"2"],xlab = "True expression profile",pch = 19,col="#00000050",ylab = "Predicted expression profile",main = "CORL105")
plot(bulk$H[6,],pred$H["2",],xlab = "True proportion",pch = 3,col="red",ylab = "Predicted proportion",main = "CORL105")

	A549_LUNG	CAL12T_LUNG	CALU1_LUNG	CALU3_LUNG	CALU6_LUNG	CORL105_LUNG
X.1	3.405440	3.420208	3.345345	3.311364	3.536505	3.390355
HIF3A	4.556388	4.406636	4.453265	4.561066	4.607109	4.140054
LOC100859930	8.619415	8.099757	8.857715	9.024848	7.891162	8.416764
RNF17	3.754861	4.137410	3.743343	3.837331	3.903201	3.732548
RNF10	7.847993	7.400762	7.962722	7.349897	7.082205	8.037870
RNF11	9.491338	10.066630	11.268840	10.575960	9.322282	9.839499

	A549_LUNG	CAL12T_LUNG	CALU1_LUNG	CALU3_LUNG	1	2
A549_LUNG	1.0000000	0.4625315	0.4924260	0.3684912	0.4237252	0.4009461
CAL12T_LUNG	0.4625315	1.0000000	0.4099629	0.3858953	0.3514726	0.3124635
CALU1_LUNG	0.4924260	0.4099629	1.0000000	0.2884619	0.4860875	0.3620164
CALU3_LUNG	0.3684912	0.3858953	0.2884619	1.0000000	0.2865035	0.4166320
CALU6_LUNG	0.4084549	0.3077822	0.4297451	0.2704010	0.9777979	0.2454628
CORL105_LUNG	0.4334871	0.3814735	0.4406733	0.4322341	0.3437657	0.9721933