EB3I n1 2025 scRNAseq
-
PRE-PROCESSING (I)
-
Load a count matrix, empty droplets & ambient RNA filtering
EB3I n1 2025 scRNAseq
-
PRE-PROCESSING (I)
-
Load a count matrix, empty droplets & ambient RNA filtering
1 PREAMBLE
1.1 Purpose of this session
This file describes the different steps to perform the first part of the single cell RNAseq data analysis training course for the EB3I n1 2025, covering these steps :
Load a single cell raw counts matrix
Remove barcodes from empty droplets
Estimate and remove ambient RNA contamination (“soup”)
1.2 An important notice
These early (but mandatory) steps of the analysis are not covered by the renowned, higher-level analysis frameworks like Seurat nor scran/scater.
Thus, we will write/use a certain amount of lines of code, some of which of a higher complexity for R beginners.
Newcomer, do not fear !
While you are welcome to try writing some by yourself, depending on your R knowledge and self-confidence in it …
… you do not have to know how these code work …
… thus you should not feel ashamed copy-pasting the pre-written code hereafter …
… but in both cases, please try your best understanding what these code do, and even more, why !
In this purpose, you may use whatever help you can by searching.
1.3 What to run ?
This training presentation, written in Rmarkdown, contains many chunks ( = code blocks)
You do not have to run them all !
Chunks follow a color code :
1.3.1 Chunks you ARE expected to run :
Show output
[1] "/shared/projects/2538_eb3i_n1_2025/atelier_scrnaseq/RMD/Preproc.1"
1.3.2 Chunks you ARE NOT expected to run
Hopefully, we will demo them. You may run them by yourself, young padawan, but if you get late or worst, break something, you’ll be the one to blame !
Show output
R version 4.4.1 (2024-06-14)
Platform: x86_64-conda-linux-gnu
Running under: Ubuntu 22.04.5 LTS
Matrix products: default
BLAS/LAPACK: /shared/ifbstor1/software/miniconda/envs/r-4.4.1/lib/libopenblasp-r0.3.29.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Paris
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] digest_0.6.37 R6_2.6.1 bookdown_0.39 fastmap_1.2.0
[5] xfun_0.52 cachem_1.1.0 knitr_1.50 htmltools_0.5.8.1
[9] rmarkdown_2.29 lifecycle_1.0.4 cli_3.6.5 sass_0.4.10
[13] jquerylib_0.1.4 compiler_4.4.1 rstudioapi_0.17.1 tools_4.4.1
[17] evaluate_1.0.3 bslib_0.9.0 rmdformats_1.0.4 yaml_2.3.10
[21] jsonlite_2.0.0 rlang_1.1.6
1.3.3 Questions
Some questions we would like you to answer during the training course. Please be fair and do not take a look at the answer before we tell you so :)
1.3.4 Answers
They usually follow questions…
# chunk_answer
cat("The Answer to the Great Question... is ...
Forty-two. Six by nine : forty-two. That's it.
That's all there is. (I always thought something was
fundamentally wrong with the universe.)")Show output
The Answer to the Great Question... is ...
Forty-two. Six by nine : forty-two. That's it.
That's all there is. (I always thought something was
fundamentally wrong with the universe.)
2 Start Rstudio
- Using the OpenOnDemand/Rstudio cheat sheet, connect to the OpenOnDemand portal and create a Rstudio session with the right resource requirements.
3 Warm-up
- We now set common parameters as new variables, once and for all for this session :
# setparam
## Set your project name
# WARNING : Do not just copy-paste this ! It's MY project name ! Put YOURS !!
project_name <- "ebaii_sc_teachers"
## Control if the project_name exists on the cluster
cat('PATH CHECK : ', dir.exists(paste0('/shared/projects/', project_name)))Show output
PATH CHECK : TRUE
4 Prepare the data structure
During the current and remaining training sessions, we will work on different data objects, some retrieved from external resources, some we will create.
In order to reuse these for further steps, or other sessions, we need to store them, thus create a directory structure on the cluster.
We will create this structure in your project folder (the one you registered at the IFB Core, in which you stored your own data for the tutoring).
To ensure that you will always use the same structure for your work, the code hereafter has been writen so that you just have to define a single variable that corresponds to your project name.
- In MY case, as a SC training teacher, it is
ebaii_sc_teachers.
- In MY case, as a SC training teacher, it is
4.1 Main directory
- Create the training course directory
# maindir
## Prepare the path in a "TD_dir" variable
TD_dir <- paste0("/shared/projects/", project_name, "/SC_TD")
## Create the root directory
dir.create(path = TD_dir, recursive = TRUE)
## Print the root directory on-screen
print(TD_dir)[1] "/shared/projects/ebaii_sc_teachers/SC_TD"
4.2 Current session
# sessiondir
## Create the session (Preproc.1) directory
session_dir <- paste0(TD_dir, "/01_Preproc.1")
dir.create(path = session_dir, recursive = TRUE)
## Print the session directory on-screen
print(session_dir)[1] "/shared/projects/ebaii_sc_teachers/SC_TD/01_Preproc.1"
4.3 Input data directory
# indir
## Create the INPUT data directory
input_dir <- paste0(session_dir, "/DATA")
dir.create(path = input_dir, recursive = TRUE)
## Print the input directory on-screen
print(input_dir)[1] "/shared/projects/ebaii_sc_teachers/SC_TD/01_Preproc.1/DATA"
4.4 Output results directory
# outdir
## Create the OUTPUT data directory
output_dir <- paste0(session_dir, "/RESULTS")
dir.create(path = output_dir, recursive = TRUE)
## Print the output directory on-screen
print(output_dir)[1] "/shared/projects/ebaii_sc_teachers/SC_TD/01_Preproc.1/RESULTS"
Note : Actually, for this session we won’t export any
result, so creating this RESULTS folder was just for practice
purpose.
Now, the we have everything needed to :
load data (
input_dir)save the results we will generate (
output_dir).
5 Additional resources
However, some upcoming steps require a bit complicated methods that are not fun to write, nor copy-paste multiple times. So, we will now prepare code functions that will ease these steps. After executing these functions code, we may call (invoke) them when needed.
5.1 Functions
5.1.1 SoupX helper function
This function was written to ease the use of SoupX on a count matrix
It was tested on SoupX=1.6.2 with igraph=1.5.1, Seurat=4.4.0,5.1.0
Parameters :
scmat_filt: (sparse)matrix corresponding to an empty droplets -filtered count matrixscmat_raw: (sparse)matrix corresponding to an NON- empty droplets -filtered count matrixsoupQuantile,contaminationRange: see?SoupX::autoEstContcontaminationRange: range of the expected soup proportionsoupRange: estimate soup fraction from features with total reads comprised in this range of counts (only used whenscmat_raw != NULL)return_object: ifTRUE, return the “unsouped” count matrix ; ifFALSE, only perform soup estimation and return the estimated proportion value (rho)doPlot: Perform the SoupX estimation plot (rho distribution)
# soupx_func
SoupX_auto <- function(scmat_filt = NULL, scmat_raw = NULL, soupQuantile = 0.9,
contaminationRange = c(.01, .8), soupRange = c(0,100),
return_object = FALSE, doPlot = FALSE) {
## Checks
if(is.null(scmat_filt)) stop('A filtered count matrix is required !')
if(is.null(scmat_raw)) message('No unfiltered raw counts matrix provided. Estimation will be based on filtered matrix only.')
## If no raw matrix
if (is.null(scmat_raw)) {
spChanRaw <- SoupX::SoupChannel(
tod = scmat_filt,
toc = scmat_filt,
calcSoupProfile = FALSE)
sc_rowsum <- sparseMatrixStats::rowSums2(scmat_filt)
spProf <- data.frame(
row.names = rownames(scmat_filt),
est = sc_rowsum/sum(scmat_filt),
counts = sc_rowsum)
spChan <- SoupX::setSoupProfile(spChanRaw, spProf)
} else {
spChan <- SoupX::SoupChannel(
tod = scmat_raw,
toc = scmat_filt,
calcSoupProfile = FALSE)
if (min(spChan$nDropUMIs) > max(soupRange)) stop(
'Minimum found counts per barcode is : ',
min(spChan$nDropUMIs),
', which is smaller than the upper bound of soupRange ! Please increase soupRange max !')
spChan <- SoupX::estimateSoup(sc = spChan, soupRange = soupRange)
}
## Display Top 20 contributing genes
if (!return_object) {
cat('\nSoup-contributing features (Top 20) :\n')
print(knitr::kable(head(
spChan$soupProfile[order(spChan$soupProfile$est, decreasing = TRUE), ],
n = 20)))
}
## Quick clustering needed
spClust <- scran::quickCluster(scmat_filt, method = "igraph")
## Adding clusters to the SoupChannel object
spChan <- SoupX::setClusters(sc = spChan, clusters = spClust)
## Estimating soup
sX <- SoupX::autoEstCont(sc = spChan, doPlot = doPlot, tfidfMin = 1,
soupQuantile = soupQuantile, maxMarkers = 100,
contaminationRange = contaminationRange,
rhoMaxFDR = .2, priorRho = .05, priorRhoStdDev = .1,
forceAccept = FALSE)
## Removing soup (adjusting counts)
if(return_object) {
cat('Counts BEFORE SoupX : ', sum(scmat_filt), '\n')
scmat_soupx <- SoupX::adjustCounts(
sX, method = 'subtraction', roundToInt = TRUE,
tol = .001, pCut = .01)
cat('Counts AFTER SoupX : ', sum(scmat_soupx), '\n')
rm(scmat_filt)
return(scmat_soupx)
} else return(sX$fit$rhoEst)
}6 Load a 10X Cell Ranger raw count matrix
6.1 The dataset
The data we will work on is hosted on Zenodo. To ease its retrieval, we can define its ID as a variable
Here will we train ourselves to load into R a single cell RNAseq data produced by 10X Genomics’ software
Cell Ranger.We will work with a public dataset provided by the manufacturer, that consists in ~ 10,000 PBMC (peripheral bone marrow cells) from a human donor (downsampled to 1/10th, for smaller computation time).
The experiment was performed with the 3’ capture kit v3.
The analysis was performed with
Cell Ranger v3, mapping on theGRCh38-2020-Amanufacturer reference.The data consists into the typical 10x 3-files structure, hosted in a Zenodo respository (Id : 14034221)
barcodes.tsv.gz: the list of putative droplet barcodesfeatures.tsv.gz: the list of features (genes)matrix.mtx.gz: the feature x barcode expression count matrix
6.2 Download data
The relatively complex code chunk hereafter has been written to :
let you download the data files from :
Zenodo by default
if
backup = TRUE, a local backup instead (in case we loose the internet connection)
if the requested data are already available locally :
load this from the local source
if
force = TRUEa new download is forced
# dlzen10x
### Named files (will be used later on !)
mtx_file <- "matrix.mtx.gz"
features_file <- "features.tsv.gz"
barcodes_file <- "barcodes.tsv.gz"
summary_file <- "pbmc_10k_v3_summary.html"
## Filename(s) to retrieve
toget_files <- c(mtx_file,
features_file,
barcodes_file,
summary_file)
## Folder to store retrieved files
local_folder <- input_dir
## Use local backup ?
backup <- FALSE
if(backup) message("Using local backup !")
## Force download ?
force <- FALSE
if(force) message("Forcing (re)download !")
### Define remote folder
remote_folder <- if (backup) paste("/shared/projects/", sessionid, "/atelier_scrnaseq/TD/BACKUP/10X/") else paste0("https://zenodo.org/records/", zen_id, "/files/")
### Reconstruct the input paths
remote_path <- paste0(remote_folder, "/", toget_files)
### Reconstruct the output paths
local_path <- paste0(local_folder, "/", toget_files)
## Retrieve files (if they don't exist), in loop
for (tg in seq_along(toget_files)) {
## If the file does not locally exist
if (!file.exists(local_path[tg]) | force) {
## Retrieve data
if(backup) {
file.copy(from = remote_path[tg],
to = local_path[tg])
} else {
download.file(url = remote_path[tg],
destfile = local_path[tg])
}
## Check if downloaded files exist locally
if(file.exists(local_path[tg])) message("\tOK")
} else message(paste0(toget_files[tg], " already downloaded !"))
}matrix.mtx.gz already downloaded !
features.tsv.gz already downloaded !
barcodes.tsv.gz already downloaded !
pbmc_10k_v3_summary.html already downloaded !
6.3 Load into R
Load the 10X data files into R :
- A good starter
Show output
https://lmddgtfy.net/?q=load%2010x%20data%20in%20R- Back to your local help ressource
We can now apply it to our data :
However, we don’t know what the loaded object actually looks like in R…
Questions :
- Do you know a way to …
Show output
[1] "dgCMatrix" "CsparseMatrix" "dsparseMatrix" "generalMatrix" [5] "AnyMatrix" "V3Matrix" "dMatrix" "sparseMatrix" [9] "Matrix"Show output
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots ..@ i : int [1:1718404] 7922 18295 23126 24791 31764 32866 13690 17058 2511 33306 ... ..@ p : int [1:147457] 0 0 0 0 0 0 6 7 8 8 ... ..@ Dim : int [1:2] 33694 147456 ..@ Dimnames:List of 2 .. ..$ : chr [1:33694] "RP11-34P13.3" "FAM138A" "OR4F5" "RP11-34P13.7" ... .. ..$ : chr [1:147456] "TCCCGATAGCCTTGAT-1" "AGATCTGAGACTGTAA-1" "GCTTGAAGTGCGAAAC-1" "ATCATGGGTATTCGTG-1" ... ..@ x : num [1:1718404] 1 1 1 1 1 1 1 1 1 1 ... ..@ factors : list()Not that meaningful, though… At least we know we have a matrix, so one can get its dimensions !
Show output
[1] 33694 147456
6.4 Features cleanup
Seurat does not like at all underscores (_) in feature names …
We will have to :
control if our dataset contains some ? Here, we will use the
base::grep()function that looks in character stringsxif they contain a provided set of characters as apattern, and returns their position (whenvalue = FALSE, default) or value itself (value = TRUE)# us_check1 ## Check feature names with underscore(s) base::grep(pattern = '_', x = rownames(scmat), value = TRUE)Show output
[1] "RP11-442N24__B.1" "RP11-99J16__A.2" "RP11-59D5__B.2" "RP11-445L13__B.3" [5] "RP11-544L8__B.4" "XXyac-YX65C7_A.2" "XXyac-YX65C7_A.3" "RP11-524D16__A.3" [9] "XX-DJ76P10__A.2" "RP11-1157N2__B.2" "RP1-213J1P__B.1" "RP1-213J1P__B.2" [13] "RP4-633O19__A.1" "RP4-754E20__A.5" "CTA-280A3__B.2"if so, clean it ! Here we will use the
base::gsub()function that looks for apatternin a character stringxand substitutes it with areplacementcharacter string.then, control the efficiency of our cleaning (with a new use of
base::grep)Show output
character(0)
7 Empty droplets
While a large part of our barcodes contain no count at all, barcodes with at least one total count may not mandatorily correspond to true cells, due to ambient RNA.
Here we want to identify empty droplets, then remove them from the matrix.
7.1 Detection
- To detect the margins of “true” cells versus empty droplets, we will
rely on the so-called
double-kneeplot: a plot of UMI counts per barcode, in function of their increasingly ordered rank.
- To draw such plot, we need to rank the barcodes thanks to their total counts.
7.1.1 Ranking barcodes
We will use the DropletUtils package
Check the ranking of barcodes, according to their counts
# bcrank_top ## Show a summary of the ranks (ordered for display convenience) print(bc_rank[order(bc_rank$rank),])DataFrame with 147456 rows and 3 columns rank total fitted <numeric> <integer> <numeric> ATCATGGCAGCCAATT-1 1 21195 NA GGGCATCGTAGCTTGT-1 2 18193 NA AGTGGGATCTTAACCT-1 3 16966 NA GTCGGGTGTGCAGTAG-1 4 16383 NA TTAGGCAAGCCGCCTA-1 5 15406 NA ... ... ... ... TTGAACGTCAGAGGTG-1 100998 0 NA ACGATGTAGTTGTCGT-1 100998 0 NA GTCAAGTTCAGTTGAC-1 100998 0 NA TCGCGAGGTTGCGCAC-1 100998 0 NA TATCTCAGTTACCAGT-1 100998 0 NA
7.1.2 Kneeplot
Now, we can draw our kneeplot :
# kneeplot1
## Plot
graphics::plot(x = bc_rank$rank,
y = bc_rank$total + 1,
log = "xy",
xlab = "Barcode rank",
ylab = "Total UMIs",
col = "black",
pch = ".",
cex = 5,
main = "Kneeplot")Question :
# q_knee Looking at the kneeplot, can you roughly predict how many barcodes will be kept as cells (ie, as non-empty droplets) ?
Then, we can use these ranks to identify “true” cells :
The function we will use relies on random number generation, so we will fix the RNG seed, for reproducibility
# emptydrops1
## Set the RNG seed
set.seed(my_seed)
## Identify empty droplets
bc_rank2 <- DropletUtils::emptyDrops(scmat)
## Whats is the structure of the returned output ?
str(bc_rank2)Show output
Formal class 'DFrame' [package "S4Vectors"] with 6 slots
..@ rownames : chr [1:147456] "TCCCGATAGCCTTGAT-1" "AGATCTGAGACTGTAA-1" "GCTTGAAGTGCGAAAC-1" "ATCATGGGTATTCGTG-1" ...
..@ nrows : int 147456
..@ elementType : chr "ANY"
..@ elementMetadata: NULL
..@ metadata :List of 5
.. ..$ lower : num 100
.. ..$ niters : num 10000
.. ..$ ambient: num [1:16738, 1] 8.43e-07 3.38e-05 9.91e-06 8.43e-07 8.43e-07 ...
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:16738] "RP11-34P13.7" "FO538757.2" "AP006222.2" "RP4-669L17.10" ...
.. .. .. ..$ : NULL
.. ..$ alpha : num 4171
.. ..$ retain : num 2611
..@ listData :List of 5
.. ..$ Total : int [1:147456] 0 0 0 0 0 6 1 1 0 0 ...
.. ..$ LogProb: num [1:147456] NA NA NA NA NA NA NA NA NA NA ...
.. ..$ PValue : num [1:147456] NA NA NA NA NA NA NA NA NA NA ...
.. ..$ Limited: logi [1:147456] NA NA NA NA NA NA ...
.. ..$ FDR : num [1:147456] NA NA NA NA NA NA NA NA NA NA ...
We can show a summary of the ordered barcodes :
# emptysmry
## Show a summary of the qualified ranks (ordered for display convenience)
print(bc_rank2[order(bc_rank2$Total, decreasing = TRUE),])Show output
DataFrame with 147456 rows and 5 columns
Total LogProb PValue Limited FDR
<integer> <numeric> <numeric> <logical> <numeric>
ATCATGGCAGCCAATT-1 21195 -13101.3 9.999e-05 TRUE 0
GGGCATCGTAGCTTGT-1 18193 -12055.5 9.999e-05 TRUE 0
AGTGGGATCTTAACCT-1 16966 -11594.8 9.999e-05 TRUE 0
GTCGGGTGTGCAGTAG-1 16383 -11994.6 9.999e-05 TRUE 0
TTAGGCAAGCCGCCTA-1 15406 -10687.9 9.999e-05 TRUE 0
... ... ... ... ... ...
TTGAACGTCAGAGGTG-1 0 NA NA NA NA
ACGATGTAGTTGTCGT-1 0 NA NA NA NA
GTCAAGTTCAGTTGAC-1 0 NA NA NA NA
TCGCGAGGTTGCGCAC-1 0 NA NA NA NA
TATCTCAGTTACCAGT-1 0 NA NA NA NA
7.1.3 Selection
Barcodes considered as empty droplets have no p-value (FDR = NA)
# bcr2_na
## Creating a validation column (empty droplets have NA as p-value)
bc_rank2$VALID <- !is.na(bc_rank2$FDR)
## Quantify "real cells"
table(bc_rank2$VALID)Show output
FALSE TRUE
146397 1059
We can control the stringency by setting a FDR-adjusted p-value threshold
Show output
[1] 0.001
## We've set the max p-value to 1E-03
bc_rank2$VALID[bc_rank2$FDR >= max_p] <- FALSE
## Control
table(bc_rank2$VALID)Show output
FALSE TRUE
146581 875
Now, we can display our true cells on the kneeplot :
# kneeplot2
## Plot with highlight of the selected barcodes (red)
graphics::plot(x = bc_rank$rank,
y = bc_rank$total+1,
log = "xy",
xlab = "Barcode rank",
ylab = "Total UMIs",
col = ifelse(bc_rank2$VALID, "red", "black"),
pch = ".",
cex = ifelse(bc_rank2$VALID, 7, 3),
main = paste0("Kneeplot (",
length(which(bc_rank2$VALID)),
" barcodes kept)"))Show plot
Question :
# a_kneered ## . You may observe that the selection of "true" cells (red) ## does not strictly correspond to a "cut" in the kneeplot ## descending curve. ## ## . This is because emptyDrops does not only "follow" the curve ## to determine a threshold corresponding to a "minimal count" ## value to consider a barcode as a cell, but also performs a ## statistical analysis for each barcode, considering how much ## its expression profile ressembles the one of other cells, ## even with a very different (lower) global level of expression.
7.2 Removal
We just need to restrict our count matrix to “true” cells.
How many barcodes do we have at the moment ?
Show output
[1] 33694 147456Filter empty droplets out :
# edfilter ## Restrict to valid barcodes scmat_cells <- scmat[, bc_rank2$VALID] ## Control dim(scmat_cells)Show output
[1] 33694 875
8 Ambient RNA
The counts measured in this matrix of (now considered “true”) single cells do not reflect the measurement of these cells expression only
Due to other over-lysed / dying cells in the emulsion, we also measured fraction of ambient RNA that adds up to the “real” expression
We may estimate it by observing counts existing at a minimal level across cells that greatly differ in their expression profile
Just in case, due to the fun tool name we will use,
SoupX, I’ll write down “soup” as an alias for “ambient RNA”, but both terms have the same meaning.
8.1 Estimate the “soup”
SoupX can estimate the rate and composition of ambient RNA, in several ways :
Manual mode : providing a list of features expected to reflect the soup (genes expressed at high levels in a majority of the expected cell types)
Automatic modes, using the empty droplets-filtered raw count matrix …
… along with the unfiltered matrix (ie, containing all quantified droplets). This is the most efficient mode.
… solely : this is less efficient, but useful when the unfiltered matrix is not available.
Let’s run SoupX, just asking for an estimation of the “soup” proportion :
# soupxrhoE
## TIP : In case of error :
# install.packages("irlba", type="source", force=TRUE)
## SoupX for ambient fraction estimation
soup_frac <- SoupX_auto(
scmat_filt = scmat_cells,
scmat_raw = scmat)Show output
Soup-contributing features (Top 20) :
| | est| counts|
|:------|---------:|------:|
|MALAT1 | 0.0321703| 17473|
|B2M | 0.0194002| 10537|
|TMSB4X | 0.0165850| 9008|
|EEF1A1 | 0.0129267| 7021|
|RPL21 | 0.0103251| 5608|
|RPS27 | 0.0100269| 5446|
|RPL13 | 0.0091542| 4972|
|RPL13A | 0.0087252| 4739|
|RPL10 | 0.0082852| 4500|
|RPLP1 | 0.0079225| 4303|
|RPL34 | 0.0078801| 4280|
|RPS12 | 0.0076481| 4154|
|RPS6 | 0.0075211| 4085|
|RPS18 | 0.0074530| 4048|
|RPS27A | 0.0073093| 3970|
|RPS2 | 0.0072965| 3963|
|RPS3A | 0.0072909| 3960|
|RPL32 | 0.0072744| 3951|
|RPL41 | 0.0067865| 3686|
|RPL11 | 0.0063446| 3446|
NOTE : We can observe MALAT1 as the top first gene (as well as multiple riboprotein-coding genes).
What’s the estimated fraction of soup in our counts ?
Show output
Soup fraction : 0.054
Question :
8.2 Remove the “soup”
SoupX can remove the soup (subtract counts for identified soup genes to all barcodes/cells) using different methods :
Automatically (safer)
Specifying a fraction of counts to remove.
Pros :
By removing X % of reads :
if soup was effectively present (at a rate <= X), this will mostly affect soup genes first.
If not, it will decrease counts globally, thus have negligible effect in differences between cells / cell types.
Cons :
Counts are already rare in droplet-based single cell technologies… Removing them “by default” makes them even more rare…
If actual soup rate is higher than the defined rate X to remove, this mode will remain uneffective…
# soupxrm
## Run SoupX in "removal" mode
scmat_unsoup <- SoupX_auto(
scmat_filt = scmat_cells,
scmat_raw = scmat,
return_object = TRUE)Show output
Counts BEFORE SoupX : 3766216
Show output
Counts AFTER SoupX : 3563040
8.3 Effect of the “soup” removal
Now, we want to characterize the effect of the removal of the ambient RNA, by comparing the pre- and post- soupX matrices.
An easy way is to describe, then visualize such an effect.
8.3.1 Describe and visualize
8.3.1.1 Before SoupX (DEMO)
We can describe our count matrix thanks to some useful metrics :
- Total counts
Show output
[1] 3766216
- Sparsity (fraction of
0s)
# descBS_sparsity
## Compute sparsity (number of matrix values at 0 / cross-product of matrix dim)
base::sum(sparseMatrixStats::colCounts(
x = scmat_cells, value = 0)
) / base::prod(dim(scmat_cells))Show output
[1] 0.9614275
- Distribution of counts in cells
# descBS_nCount
## Sums up counts for each cell
base::summary(sparseMatrixStats::colSums2(x = scmat_cells, na.rm = TRUE))Show output
Min. 1st Qu. Median Mean 3rd Qu. Max.
116 3069 3957 4304 5014 21195
- Distribution of expressed genes in cells
# descBS_nFeature
## All genes minus 0-count genes
base::summary(nrow(scmat_cells) - sparseMatrixStats::colCounts(
x = scmat_cells,
value = 0,
na.rm = TRUE))Show output
Min. 1st Qu. Median Mean 3rd Qu. Max.
89 1052 1242 1300 1484 3752
8.3.1.2 After SoupX
We can describe our count matrix thanks to some useful metrics :
- Total counts
Show output
[1] 3563040
- Sparsity (fraction of
0s)
# descAS_sparsity
## Compute sparsity (number of matrix values at 0 / cross-product of matrix dim)
base::sum(sparseMatrixStats::colCounts(
x = scmat_unsoup, value = 0)
) / base::prod(dim(scmat_unsoup))Show output
[1] 0.9635223
- Distribution of counts in cells
# descAS_nCount
## Sums up counts for each cell
base::summary(sparseMatrixStats::colSums2(x = scmat_unsoup, na.rm = TRUE))Show output
Min. 1st Qu. Median Mean 3rd Qu. Max.
113 2912 3735 4072 4754 19797
- Distribution of expressed genes in cells
# descAS_nFeature
## Compute sparsity (number of matrix values at 0 / cross-product of matrix dim)
base::summary(nrow(scmat_cells) - sparseMatrixStats::colCounts(
x = scmat_unsoup,
value = 0,
na.rm = TRUE))Show output
Min. 1st Qu. Median Mean 3rd Qu. Max.
88 1002 1175 1229 1400 3435
Questions :
# q_descdiff Can you describe the difference (especially for counts and sparsity) ? Was it expected ?
What are the features most affected by the soup removal ?
# soupfeatures
## Compute the fraction matrix (PRE / POST)
## NOTE : we may have 0 counts in the divider,
## so we increment both matrix by +1 !
cont_frac <- (scmat_cells + 1) / (scmat_unsoup + 1)
## Fraction of counts decreased by soup removal, per feature (ordered)
feat_frac <- sort(sparseMatrixStats::rowMeans2(cont_frac), decreasing = TRUE)
### Display Top 5 features
utils::head(feat_frac, decreasing = TRUE)Show output
LYZ S100A9 S100A8 HLA-DRA CD74 CST3
1.478310 1.453317 1.384649 1.343360 1.279586 1.204137
Hint : fraction or difference ?
# soupfeaturesdiff
## Compute the DIFFERENCE matrix (PRE / POST)
## NOTE : we may have 0 counts in the divider,
## so we increment both matrix by +1 !
cont_dif <- scmat_cells - scmat_unsoup
## Fraction of counts decreased by soup removal, per feature (ordered)
feat_dif <- sort(sparseMatrixStats::rowMeans2(cont_dif), decreasing = TRUE)
### Display Top 5 features
utils::head(feat_dif, decreasing = TRUE)Show output
MALAT1 B2M TMSB4X EEF1A1 RPL21 RPS27
7.533714 4.528000 3.885714 3.060571 2.435429 2.372571
8.3.2 Plot the “top 1” feature
We want to visualize the expression of the “top” soup gene, LYZ, in our dataset.
To do so, we will generate a projection of the data into a reduced (2-D) mathematical space. This is done by using a series of R function from the Seurat package, that we won’t describe here and now : understanding these commands below is not the scope of our current session, but it will be during the next few days.
Actually, these commands perform all the steps we’ll see in the next few days, in a row, but with default values (thus, almost certainly inadequate to our current data).
In consequence, please just consider the output as a way to quickly and dirtily visualize the data : the depicted topology should be considered as very rough and mostly imperfect.
8.3.2.1 Before SoupX
# bsoupx
## Roll process, using a temporary "sobj_presoupx" Seurat object
sobj_presoupx <- Seurat::CreateSeuratObject(counts = scmat_cells, project = "PBMC10K_BeforeSoupX")
sobj_presoupx <- Seurat::NormalizeData(object = sobj_presoupx, verbose = FALSE)
sobj_presoupx <- Seurat::ScaleData(object = sobj_presoupx, verbose = FALSE)
sobj_presoupx <- Seurat::FindVariableFeatures(object = sobj_presoupx, verbose = FALSE)
sobj_presoupx <- Seurat::RunPCA(object = sobj_presoupx, npcs = 21, verbose = FALSE)
sobj_presoupx <- Seurat::RunUMAP(object = sobj_presoupx, dims = c(1:20), verbose = FALSE)
## Top1 feature plot
fp_pre <- Seurat::FeaturePlot(object = sobj_presoupx, features = 'LYZ') + Seurat::DarkTheme()
print(fp_pre)Show plot
8.3.2.2 After SoupX
# asoupx
## Roll process, using a temporary "sobj_postsoupx" Seurat object
sobj_postsoupx <- Seurat::CreateSeuratObject(counts = scmat_unsoup, project = "PBMC10K_AfterSoupX")
sobj_postsoupx <- Seurat::NormalizeData(object = sobj_postsoupx, verbose = FALSE)
sobj_postsoupx <- Seurat::ScaleData(object = sobj_postsoupx, verbose = FALSE)
sobj_postsoupx <- Seurat::FindVariableFeatures(object = sobj_postsoupx, verbose = FALSE)
sobj_postsoupx <- Seurat::RunPCA(object = sobj_postsoupx, npcs = 21, verbose = FALSE)
sobj_postsoupx <- Seurat::RunUMAP(object = sobj_postsoupx, dims = c(1:20), verbose = FALSE)
## Top1 feature plot
fp_post <- Seurat::FeaturePlot(object = sobj_postsoupx, features = 'LYZ') + Seurat::DarkTheme()
print(fp_post)Show plot
Merging plots for ease of use :
# soupx_umaps
## Using the patchwork package to merge plots (and ggplot2 to add titles)
patchwork::wrap_plots(
list(
fp_pre & ggplot2::ggtitle(label = "LYZ before SoupX"),
fp_post & ggplot2::ggtitle(label = "LYZ after SoupX")),
nrow = 1)Show plot
Question
# q_umapsoupx Can you compare the two plots : - for the LYZ expression across the 2-D space ? - about the space topologies ? - for the LYZ expression globally ?# a_umapsoupx ## . The level of expression measured in cells with smaller expression ## before SoupX has gone to almost none after : the ambient expression ## of this gene has efficiently been removed. ## ## . Hopefully, the cluster(s) with higher expression remain(s) high. ## ## . The expression scale upper bound after SoupX is higher than before ?! ## But we REMOVED some counts ?! While being counter-intuitive, this is ## a positive consequence of the ambient removal on the scaling of ## barcodes expression (see later). ## ## . The range of both X and Y axes have increased, depicting a better ## separation of / distance between some of the clusters. ## ## . The global topology of clusters remains close (but not identical). ## ## . Clusters compacity has increased.
8.4 Conclusion
Removing ambient RNA has a positive effect :
on the quality of the expression level measurements
on the observed topology
… despite an estimation of only ~5% !
Actually, ambient RNA level and composition is (one of) the major source(s) of the batch effect bias that may alter the integration of different samples.
Beyond :
One can check if the “unsouped” (ie, post-SoupX) matrix retain some soup ?. How would you do it ?
Save the
sobj_presoupxR object on disk.Save both the
sobj_presoupxandsobj_postsoupxobjects in a single archive on disk.Same as 1. but use the
bzip2compression algorithm
9 Rsession
For reproducibility and context, it is recommended to include in your RMarkdown the list of loaded packages and their version.
Show output
R version 4.4.1 (2024-06-14)
Platform: x86_64-conda-linux-gnu
Running under: Ubuntu 22.04.5 LTS
Matrix products: default
BLAS/LAPACK: /shared/ifbstor1/software/miniconda/envs/r-4.4.1/lib/libopenblasp-r0.3.29.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Paris
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] RcppAnnoy_0.0.22 splines_4.4.1
[3] later_1.4.2 tibble_3.2.1
[5] R.oo_1.27.0 polyclip_1.10-7
[7] fastDummies_1.7.5 lifecycle_1.0.4
[9] edgeR_4.2.2 globals_0.18.0
[11] lattice_0.22-6 MASS_7.3-65
[13] magrittr_2.0.3 limma_3.60.6
[15] plotly_4.10.4 sass_0.4.10
[17] rmarkdown_2.29 jquerylib_0.1.4
[19] yaml_2.3.10 metapod_1.12.0
[21] httpuv_1.6.15 Seurat_5.3.0
[23] sctransform_0.4.2 spam_2.11-1
[25] sp_2.2-0 spatstat.sparse_3.1-0
[27] reticulate_1.42.0 cowplot_1.1.3
[29] pbapply_1.7-2 RColorBrewer_1.1-3
[31] abind_1.4-8 zlibbioc_1.50.0
[33] Rtsne_0.17 GenomicRanges_1.56.2
[35] purrr_1.0.4 R.utils_2.13.0
[37] BiocGenerics_0.50.0 GenomeInfoDbData_1.2.12
[39] IRanges_2.38.1 S4Vectors_0.42.1
[41] ggrepel_0.9.6 irlba_2.3.5.1
[43] listenv_0.9.1 spatstat.utils_3.1-4
[45] goftest_1.2-3 RSpectra_0.16-2
[47] spatstat.random_3.4-1 dqrng_0.4.1
[49] fitdistrplus_1.2-2 parallelly_1.45.0
[51] DelayedMatrixStats_1.26.0 codetools_0.2-20
[53] DropletUtils_1.24.0 DelayedArray_0.30.1
[55] scuttle_1.14.0 tidyselect_1.2.1
[57] UCSC.utils_1.0.0 farver_2.1.2
[59] ScaledMatrix_1.12.0 SoupX_1.6.2
[61] matrixStats_1.5.0 stats4_4.4.1
[63] rmdformats_1.0.4 spatstat.explore_3.4-3
[65] jsonlite_2.0.0 BiocNeighbors_1.22.0
[67] progressr_0.15.1 ggridges_0.5.6
[69] survival_3.7-0 tools_4.4.1
[71] ica_1.0-3 Rcpp_1.0.14
[73] glue_1.8.0 gridExtra_2.3
[75] SparseArray_1.4.8 xfun_0.52
[77] MatrixGenerics_1.16.0 GenomeInfoDb_1.40.1
[79] dplyr_1.1.4 HDF5Array_1.32.0
[81] withr_3.0.2 fastmap_1.2.0
[83] bluster_1.14.0 rhdf5filters_1.16.0
[85] rsvd_1.0.5 digest_0.6.37
[87] R6_2.6.1 mime_0.13
[89] scattermore_1.2 tensor_1.5
[91] dichromat_2.0-0.1 spatstat.data_3.1-6
[93] R.methodsS3_1.8.2 tidyr_1.3.1
[95] generics_0.1.4 data.table_1.17.4
[97] httr_1.4.7 htmlwidgets_1.6.4
[99] S4Arrays_1.4.1 uwot_0.2.3
[101] pkgconfig_2.0.3 gtable_0.3.6
[103] lmtest_0.9-40 SingleCellExperiment_1.26.0
[105] XVector_0.44.0 htmltools_0.5.8.1
[107] dotCall64_1.2 bookdown_0.39
[109] SeuratObject_5.1.0 scales_1.4.0
[111] Biobase_2.64.0 png_0.1-8
[113] spatstat.univar_3.1-3 scran_1.32.0
[115] knitr_1.50 rstudioapi_0.17.1
[117] reshape2_1.4.4 nlme_3.1-165
[119] cachem_1.1.0 zoo_1.8-14
[121] rhdf5_2.48.0 stringr_1.5.1
[123] KernSmooth_2.23-24 parallel_4.4.1
[125] miniUI_0.1.2 pillar_1.10.2
[127] grid_4.4.1 vctrs_0.6.5
[129] RANN_2.6.2 promises_1.3.2
[131] BiocSingular_1.20.0 beachmat_2.20.0
[133] xtable_1.8-4 cluster_2.1.6
[135] evaluate_1.0.3 cli_3.6.5
[137] locfit_1.5-9.9 compiler_4.4.1
[139] rlang_1.1.6 crayon_1.5.3
[141] future.apply_1.11.3 labeling_0.4.3
[143] plyr_1.8.9 stringi_1.8.7
[145] viridisLite_0.4.2 deldir_2.0-4
[147] BiocParallel_1.38.0 lazyeval_0.2.2
[149] spatstat.geom_3.4-1 Matrix_1.7-3
[151] RcppHNSW_0.6.0 patchwork_1.3.0
[153] sparseMatrixStats_1.16.0 future_1.49.0
[155] ggplot2_3.5.2 Rhdf5lib_1.26.0
[157] statmod_1.5.0 shiny_1.10.0
[159] SummarizedExperiment_1.34.0 ROCR_1.0-11
[161] igraph_2.1.4 bslib_0.9.0