This file describes the different steps to perform the first part of the single cell RNAseq data analysis training course for the EBAII n1 2023, covering these steps :
For this training session, all the computation will be done in-memory. We won’t write any result on disk, so we won’t define any output directory.
Set your working directory in your TD output directory (golf is mine!)
## Loading unfiltered 10X matrix
scmat <- Seurat::Read10X(data.dir = data_10x_dir)
## Cleaning symbols (Seurat hates '_' in feature names!)
base::rownames(scmat) <- base::gsub(
pattern = '_',
replacement = '-',
x = base::rownames(scmat))
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
..@ i : int [1:8520816] 15890 31763 10736 33658 17290 31763 895 2170 3957 5611 ...
..@ p : int [1:737281] 0 1 1 2 2 3 4 4 4 4 ...
..@ Dim : int [1:2] 33694 737280
..@ Dimnames:List of 2
.. ..$ : chr [1:33694] "RP11-34P13.3" "FAM138A" "OR4F5" "RP11-34P13.7" ...
.. ..$ : chr [1:737280] "AAACCTGAGAAACCAT-1" "AAACCTGAGAAACCGC-1" "AAACCTGAGAAACCTA-1" "AAACCTGAGAAACGAG-1" ...
..@ x : num [1:8520816] 1 1 1 1 1 1 1 1 1 2 ...
..@ factors : list()
Question : Do you understand the str() output here ?
## . This is absolutely normal ! The object we observe here is
## not a basic R matrix, but a 'dgCMatrix'. This is a special format
## created to store sparse interger matrices efficiently (taking)
## less CPU and time resources).
## . You can still get some information on the features and barcodes
## in @Dimnames, as well as very few count examples in @x.
[1] 33694 737280
## Getting the number of barcodes
n_bc <- ncol(scmat)
## Setting seed for RNG
base::set.seed(my_seed)
## Slicing (1:5)
scmat <- scmat[, base::sample(
x = 1:n_bc,
size = base::round(n_bc/5),
replace = FALSE)]
## New dimensions of the 1/5th slice :
base::dim(scmat)
[1] 33694 147456
To detect the margins of “true” cells versus empty droplets, we will rely on the “kneeplot” (UMIs per barcode in function of their increasingly ordered rank)
To perform the kneeplot, we need to rank the barcodes thanks to their total counts :
## Generate the kneeplot and rank statistics
bc_rank <- DropletUtils::barcodeRanks(scmat)
## Show a summary of the ranks (ordered for display convenience)
print(bc_rank[order(bc_rank$rank),])
DataFrame with 147456 rows and 3 columns
rank total fitted
<numeric> <integer> <numeric>
ATCATGGCAGCCAATT-1 1 21195 NA
GGGCATCGTAGCTTGT-1 2 18193 NA
AGTGGGATCTTAACCT-1 3 16966 NA
GTCGGGTGTGCAGTAG-1 4 16383 NA
TTAGGCAAGCCGCCTA-1 5 15406 NA
... ... ... ...
TTGAACGTCAGAGGTG-1 100998 0 NA
ACGATGTAGTTGTCGT-1 100998 0 NA
GTCAAGTTCAGTTGAC-1 100998 0 NA
TCGCGAGGTTGCGCAC-1 100998 0 NA
TATCTCAGTTACCAGT-1 100998 0 NA
Now, we can get our kneeplot :
## Plot
graphics::plot(bc_rank$rank, bc_rank$total+1, log = "xy",
xlab = "Barcode rank", ylab = "Total UMIs",
col = "black",
pch = '.', cex = 5,
main = 'Kneeplot')
Question : Looking at the kneeplot, can you roughly predict how many barcodes will be kept as cells (ie, as non-empty droplets) ?
Then, we can use these ranks to identify “true” cells :
## Identify empty droplets
base::set.seed(my_seed)
bc_rank2 <- DropletUtils::emptyDrops(scmat)
## Show a summary of the qualified ranks (ordered for display convenience)
print(bc_rank2[order(bc_rank2$Total, decreasing = TRUE),])
DataFrame with 147456 rows and 5 columns
Total LogProb PValue Limited FDR
<integer> <numeric> <numeric> <logical> <numeric>
ATCATGGCAGCCAATT-1 21195 -13101.3 9.999e-05 TRUE 0
GGGCATCGTAGCTTGT-1 18193 -12055.5 9.999e-05 TRUE 0
AGTGGGATCTTAACCT-1 16966 -11594.8 9.999e-05 TRUE 0
GTCGGGTGTGCAGTAG-1 16383 -11994.6 9.999e-05 TRUE 0
TTAGGCAAGCCGCCTA-1 15406 -10687.9 9.999e-05 TRUE 0
... ... ... ... ... ...
TTGAACGTCAGAGGTG-1 0 NA NA NA NA
ACGATGTAGTTGTCGT-1 0 NA NA NA NA
GTCAAGTTCAGTTGAC-1 0 NA NA NA NA
TCGCGAGGTTGCGCAC-1 0 NA NA NA NA
TATCTCAGTTACCAGT-1 0 NA NA NA NA
How many barcodes were identified as “true” cells ?
FALSE TRUE
1059 146397
How many “true” cells have a high confidence ?
## Using FDR p-val threshold of 0.001
keep.bc <- bc_rank2$FDR < 1E-03
base::rm(bc_rank2)
## Check how many were detected
base::table(keep.bc, useNA = "always")
keep.bc
FALSE TRUE <NA>
184 875 146397
We need to set those NAs to FALSE
## Set undetermined to FALSE
keep.bc[is.na(keep.bc)] <- FALSE
base::table(keep.bc, useNA = "always")
keep.bc
FALSE TRUE <NA>
146581 875 0
Now, we can display our true cells on the kneeplot
## Plot
graphics::plot(bc_rank$rank, bc_rank$total+1, log = "xy",
xlab = "Barcode rank", ylab = "Total UMIs",
col = ifelse(keep.bc, "red", "black"),
pch = '.', cex = ifelse(keep.bc, 7, 3),
main = paste0(
'Kneeplot (',
length(which(keep.bc)),
' barcodes kept)'))
Question : Is there something unexpected for the retained “true” cells ?
## . You may observe that the selection of "true" cells does
## not strictly correspond to a "cut" in the kneeplot
## descending curve.
## . This is because emptyDrops does not only "follow" the curve
## to determine a threshold corresponding to a "minimal count"
## value to consider a barcode as a cell, but also performs a
## statistical analysis for each barcode, considering how much
## its expression profile ressembles the one of other cells,
## even with a very different global level of expression.
SoupX can estimate the rate and composition of ambient RNA, in several ways :
Soup-contributing features (Top 20) :
| | est| counts|
|:------|---------:|------:|
|MALAT1 | 0.0321703| 17473|
|B2M | 0.0194002| 10537|
|TMSB4X | 0.0165850| 9008|
|EEF1A1 | 0.0129267| 7021|
|RPL21 | 0.0103251| 5608|
|RPS27 | 0.0100269| 5446|
|RPL13 | 0.0091542| 4972|
|RPL13A | 0.0087252| 4739|
|RPL10 | 0.0082852| 4500|
|RPLP1 | 0.0079225| 4303|
|RPL34 | 0.0078801| 4280|
|RPS12 | 0.0076481| 4154|
|RPS6 | 0.0075211| 4085|
|RPS18 | 0.0074530| 4048|
|RPS27A | 0.0073093| 3970|
|RPS2 | 0.0072965| 3963|
|RPS3A | 0.0072909| 3960|
|RPL32 | 0.0072744| 3951|
|RPL41 | 0.0067865| 3686|
|RPL11 | 0.0063446| 3446|
What’s the estimated fraction of soup in our counts ?
Soup fraction : 0.054
SoupX can remove the soup (subtract counts for identified soup genes to all barcodes/cells) using different methods :
## Run SoupX in "removal" mode
scmat_unsoup <- EBAII.n1.SC.helper::SoupX_auto(
scmat_filt = scmat_cells,
scmat_raw = scmat,
return_object = TRUE)
Counts BEFORE SoupX : 3766216
Counts AFTER SoupX : 3563040
Now, we want to characterize the effect of the removal of the ambient RNA, by comparing the pre/post matrices. In this purpose, we will construct a Seurat object from each of these, that we will describe, then plot.
But how to create a Seurat object from a count matrix ?
To describe its content, we will use a function written just for you :
Now, let’s create and characterize both versions :
## Create the Seurat object for unsouped cells
sobj <- Seurat::CreateSeuratObject(
counts = scmat_cells,
project = 'PBMC10K_UNSOUPED')
Describing
OBJECT VERSION : 5.0.0
PROJECT : [PBMC10K_UNSOUPED]
[ASSAYS]
ASSAY 1 : [RNA] [ACTIVE]
SLOT 1 : [counts] Dims:[33694 x 875] Range:[0.00-724.00]
Counts : 3766216
Sparsity : 96.14275%
SLOT 2 : [data] Dims:[33694 x 875] Range:[0.00-724.00]
Sparsity : 96.14275%
SLOT 3 : [scale.data] Dims:[0 x 0]
[DIMREDS]
[BARCODES METADATA]
orig.ident Freq
----------------- -----
PBMC10K_UNSOUPED 875
NA 0
nCount_RNA
Min. 1st Qu. Median Mean 3rd Qu. Max.
116 3069 3957 4304 5014 21195
nFeature_RNA
Min. 1st Qu. Median Mean 3rd Qu. Max.
89 1052 1242 1300 1484 3752
Creating the Seurat object
## Create the Seurat object for unsouped cells
sobj_unsoup <- Seurat::CreateSeuratObject(
counts = scmat_unsoup,
project = 'PBMC10K_SOUPED')
Describing
OBJECT VERSION : 5.0.0
PROJECT : [PBMC10K_SOUPED]
[ASSAYS]
ASSAY 1 : [RNA] [ACTIVE]
SLOT 1 : [counts] Dims:[33694 x 875] Range:[0.00-699.00]
Counts : 3563040
Sparsity : 96.35223%
SLOT 2 : [data] Dims:[33694 x 875] Range:[0.00-699.00]
Sparsity : 96.35223%
SLOT 3 : [scale.data] Dims:[0 x 0]
[DIMREDS]
[BARCODES METADATA]
orig.ident Freq
--------------- -----
PBMC10K_SOUPED 875
NA 0
nCount_RNA
Min. 1st Qu. Median Mean 3rd Qu. Max.
113 2912 3735 4072 4754 19797
nFeature_RNA
Min. 1st Qu. Median Mean 3rd Qu. Max.
88 1002 1175 1229 1400 3435
Question : Can you describe the difference ? Is it as expected ?
## . Total counts, max value, counts per cell and number of
## expressed features per cell, all have decreased.
## . The sparsity level has slightly increased (removing the soup
## obliterated all counts for some features in some cells).
## . All of this is completely expected.
What are the features most affected by the soup removal ?
## Display a soup gene
### Compute the fraction matrix (PRE / POST)
### NOTE : we have 0 counts in the divider,
### so we increment both matrix by +1 !
cont_frac <- (scmat_cells + 1) / (scmat_unsoup + 1)
### Fraction of counts decrease by soup removal, per feature
feat_frac <- sparseMatrixStats::rowMeans2(cont_frac)
## Removing matrices to free some RAM
rm(scmat_cells, scmat_unsoup, cont_frac)
### Top 5 features
utils::head(base::sort(feat_frac, decreasing = TRUE))
LYZ S100A9 S100A8 HLA-DRA CD74 CST3
1.478310 1.453317 1.384649 1.343360 1.279586 1.204137
Plotting the top1 feature
Question : Can you compare the two plots ?
## . The level expression measured in cells with smaller expression
## before SoupX has gone to almost none after : the ambient expression
## of this gene has efficiently been removed.
## . Hopefully, the cluster(s) with higher expression remain(s) high.
## . The expression scale upper bound after SoupX is higher than before ?!
## But we REMOVED some counts ?! While being counterintuitive, this is
## a positive consequence of the ambient removal on the scaling of
## barcodes expression (see later).
## . The global topology of clusters remains close (but not identical).
## . However, the compacity of clusters has increased !
Rsession
R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=fr_FR.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=fr_FR.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Paris
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Biobase_2.62.0 BiocGenerics_0.48.0 S4Vectors_0.40.1
loaded via a namespace (and not attached):
[1] IRanges_2.36.0 R.methodsS3_1.8.2
[3] GSEABase_1.64.0 progress_1.2.2
[5] nnet_7.3-19 goftest_1.2-3
[7] DT_0.30 Biostrings_2.70.1
[9] HDF5Array_1.30.0 vctrs_0.6.4
[11] spatstat.random_3.2-1 digest_0.6.33
[13] png_0.1-8 shape_1.4.6
[15] shinyBS_0.61.1 registry_0.5-1
[17] ggrepel_0.9.4 deldir_1.0-9
[19] parallelly_1.36.0 MASS_7.3-60
[21] reshape2_1.4.4 withr_2.5.2
[23] httpuv_1.6.12 foreach_1.5.2
[25] xfun_0.41 ellipsis_0.3.2
[27] survival_3.5-7 memoise_2.0.1
[29] zoo_1.8-12 GlobalOptions_0.1.2
[31] pbapply_1.7-2 R.oo_1.25.0
[33] Formula_1.2-5 prettyunits_1.2.0
[35] KEGGREST_1.42.0 promises_1.2.1
[37] httr_1.4.7 globals_0.16.2
[39] fitdistrplus_1.1-11 rhdf5filters_1.14.0
[41] rhdf5_2.46.0 rstudioapi_0.15.0
[43] shinyAce_0.4.2 miniUI_0.1.1.1
[45] generics_0.1.3 base64enc_0.1-3
[47] curl_5.1.0 zlibbioc_1.48.0
[49] ScaledMatrix_1.10.0 polyclip_1.10-6
[51] ca_0.71.1 ExperimentHub_2.10.0
[53] GenomeInfoDbData_1.2.11 SparseArray_1.2.0
[55] RBGL_1.78.0 threejs_0.3.3
[57] interactiveDisplayBase_1.40.0 xtable_1.8-4
[59] stringr_1.5.0 doParallel_1.0.17
[61] evaluate_0.23 S4Arrays_1.2.0
[63] BiocFileCache_2.10.1 hms_1.1.3
[65] GenomicRanges_1.54.1 irlba_2.3.5.1
[67] colorspace_2.1-0 filelock_1.0.2
[69] ROCR_1.0-11 reticulate_1.34.0
[71] pcaExplorer_2.28.0 spatstat.data_3.0-3
[73] magrittr_2.0.3 lmtest_0.9-40
[75] Rgraphviz_2.46.0 later_1.3.1
[77] viridis_0.6.4 lattice_0.22-5
[79] spatstat.geom_3.2-7 NMF_0.26
[81] future.apply_1.11.0 genefilter_1.84.0
[83] SparseM_1.81 scattermore_1.2
[85] XML_3.99-0.14 scuttle_1.12.0
[87] cowplot_1.1.1 matrixStats_1.0.0
[89] RcppAnnoy_0.0.21 Hmisc_5.1-1
[91] pillar_1.9.0 nlme_3.1-163
[93] iterators_1.0.14 gridBase_0.4-7
[95] compiler_4.3.1 beachmat_2.18.0
[97] stringi_1.7.12 Category_2.68.0
[99] TSP_1.2-4 EBAII.n1.SC.helper_0.0.2
[101] tensor_1.5 SummarizedExperiment_1.30.2
[103] dendextend_1.17.1 plyr_1.8.9
[105] crayon_1.5.2 abind_1.4-5
[107] SoupX_1.6.2 locfit_1.5-9.8
[109] sp_2.1-1 bit_4.0.5
[111] dplyr_1.1.3 codetools_0.2-19
[113] BiocSingular_1.18.0 crosstalk_1.2.0
[115] bslib_0.5.1 GetoptLong_1.0.5
[117] plotly_4.10.3 mime_0.12
[119] splines_4.3.1 circlize_0.4.15
[121] Rcpp_1.0.11 dbplyr_2.4.0
[123] sparseMatrixStats_1.14.0 knitr_1.45
[125] blob_1.2.4 utf8_1.2.4
[127] clue_0.3-65 BiocVersion_3.18.0
[129] listenv_0.9.0 checkmate_2.2.0
[131] DelayedMatrixStats_1.24.0 tibble_3.2.1
[133] Matrix_1.6-1.1 statmod_1.5.0
[135] pkgconfig_2.0.3 pheatmap_1.0.12
[137] tools_4.3.1 cachem_1.0.8
[139] RSQLite_2.3.2 viridisLite_0.4.2
[141] DBI_1.1.3 celldex_1.12.0
[143] fastmap_1.1.1 rmarkdown_2.25
[145] scales_1.2.1 grid_4.3.1
[147] ica_1.0-3 Seurat_4.4.0
[149] shinydashboard_0.7.2 AnnotationHub_3.10.0
[151] sass_0.4.7 patchwork_1.1.3
[153] BiocManager_1.30.22 dotCall64_1.1-0
[155] graph_1.80.0 SingleR_2.4.0
[157] RANN_2.6.1 rpart_4.1.21
[159] farver_2.1.1 yaml_2.3.7
[161] AnnotationForge_1.44.0 MatrixGenerics_1.14.0
[163] foreign_0.8-85 cli_3.6.1
[165] purrr_1.0.2 stats4_4.3.1
[167] webshot_0.5.5 leiden_0.4.3
[169] lifecycle_1.0.3 uwot_0.1.16
[171] bluster_1.12.0 backports_1.4.1
[173] DropletUtils_1.20.0 BiocParallel_1.36.0
[175] annotate_1.80.0 gtable_0.3.4
[177] rjson_0.2.21 ggridges_0.5.4
[179] progressr_0.14.0 parallel_4.3.1
[181] limma_3.58.1 jsonlite_1.8.7
[183] edgeR_4.0.1 seriation_1.5.1
[185] bitops_1.0-7 ggplot2_3.4.4
[187] bit64_4.0.5 assertthat_0.2.1
[189] Rtsne_0.16 spatstat.utils_3.0-4
[191] BiocNeighbors_1.20.0 SeuratObject_5.0.0
[193] heatmaply_1.5.0 jquerylib_0.1.4
[195] highr_0.10 metapod_1.10.0
[197] dqrng_0.3.1 R.utils_2.12.2
[199] lazyeval_0.2.2 shiny_1.7.5.1
[201] htmltools_0.5.6.1 GO.db_3.18.0
[203] sctransform_0.4.1 rappdirs_0.3.3
[205] glue_1.6.2 spam_2.10-0
[207] XVector_0.42.0 RCurl_1.98-1.12
[209] scran_1.30.0 gridExtra_2.3
[211] igraph_1.5.1 R6_2.5.1
[213] tidyr_1.3.0 DESeq2_1.42.0
[215] SingleCellExperiment_1.22.0 labeling_0.4.3
[217] cluster_2.1.4 rngtools_1.5.2
[219] Rhdf5lib_1.24.0 GenomeInfoDb_1.38.0
[221] DelayedArray_0.28.0 tidyselect_1.2.0
[223] htmlTable_2.4.2 GOstats_2.68.0
[225] xml2_1.3.5 AnnotationDbi_1.64.0
[227] future_1.33.0 rsvd_1.0.5
[229] munsell_0.5.0 KernSmooth_2.23-22
[231] topGO_2.54.0 data.table_1.14.8
[233] htmlwidgets_1.6.2 ComplexHeatmap_2.18.0
[235] RColorBrewer_1.1-3 biomaRt_2.58.0
[237] rlang_1.1.1 spatstat.sparse_3.0-3
[239] spatstat.explore_3.2-5 fansi_1.0.5