---
title: "
EB3I n1 2025 scRNAseq
-
PRE-PROCESSING (I)
-
Load a count matrix, empty droplets & ambient RNA filtering"
date: "2025-16-21.22"
author:
- name: "EB3I n1 scRNAseq Team"
- name: "Bastien JOB"
email: "bastien.job@gustaveroussy.fr"
- name: "William JARASSIER"
email: "w.jarassier@gmail.com"
output:
rmdformats::readthedown:
fig_width: 8
fig_height: 6
highlight: tango ## Theme for the code chunks
embed_fonts: TRUE
number_sections: true ## Adds number to headers (sections)
theme: flatly ## CSS theme for the HTML page
collapsed: true ## By default, the TOC is folded
toc_depth: 3
smooth_scroll: true ## Smooth scroll of the HTML page
self_contained: true ## Includes all plots/images within the HTML
code_download: true ## Adds a button to download the Rmd
code_folding: show
thumbnails: false
lightbox: true
fig_caption: false
gallery: true
use_bookdown: true
always_allow_html: true ## Allow plain HTML code in the Rmd
editor_options:
markdown:
wrap: 72
---
```{r knit_setup, echo = FALSE}
knitr::opts_chunk$set(
echo = TRUE, # Print the code
eval = TRUE, # Run command lines
message = FALSE, # Print messages
prompt = FALSE, # Do not display prompt
comment = NA, # No comments on this section
warning = FALSE, # Display warnings
tidy = FALSE,
fig.align="center",
# results = 'hide',
width = 100 # Number of characters per line
)
```
```{css, echo=FALSE}
.notrun {
background-color: lightgrey !important;
border: 3px solid black !important;
}
.notruno {
background-color: lightgrey !important;
color : black !important;
}
.question {
background-color: aquamarine !important;
color : black !important;
border: 3px solid limegreen !important;
}
.questiono {
background-color: aquamarine !important;
color : black !important;
}
.answer {
background-color: navajowhite !important;
border: 3px solid brown !important;
}
.answero {
background-color: navajowhite !important;
color : black !important;
}
.beyond {
background-color: violet !important;
border: 3px solid purple !important;
}
.beyondo {
background-color: violet !important;
color : black !important;
}
```
```{r knit_hook, echo = FALSE}
hooks = knitr::knit_hooks$get()
hook_foldable = function(type) {
force(type)
function(x, options) {
res = hooks[[type]](x, options)
if (isFALSE(options[[paste0("fold.", type)]])) return(res)
paste0(
"Show ", type, "
\n\n",
res,
"\n\n "
)
}
}
knitr::knit_hooks$set(
output = hook_foldable("output"),
plot = hook_foldable("plot")
)
```
------------------------------------------------------------------------
------------------------------------------------------------------------

------------------------------------------------------------------------
------------------------------------------------------------------------
# PREAMBLE
## Purpose of this session
This file describes the different steps to perform the **first part** of
the **single cell RNAseq** data analysis training course for the **EB3I
n1 2025**, covering these steps :
- **Load** a single cell raw counts **matrix**
- **Remove** barcodes from **empty droplets**
- **Estimate** and remove **ambient RNA** contamination *("soup")*
## An important notice

- These early (but mandatory) steps of the analysis are not covered by
the renowned, higher-level analysis frameworks like **Seurat** nor
**scran/scater**.
- Thus, we will write/use a certain amount of **lines of code**, some
of which of a higher complexity for R beginners.
- **Newcomer, do not fear** !
- While you are **welcome to try** writing some by yourself,
depending on your R knowledge and self-confidence in it ...
- ... you do not have to know [**how**]{.underline} these code
work ...
- ... thus you **should not feel ashamed** copy-pasting the
pre-written code hereafter ...
- ... but in both cases, please try **your best understanding**
[**what**]{.underline} these code do, and even more,
[**why**]{.underline} !
- In this purpose, you may use whatever **help** you can **by**
**searching**.
## What to run ?
- This training presentation, written in Rmarkdown, contains **many**
chunks ( = code blocks)
- You **do not have to** run them all !
- Chunks follow a **color code** :
### Chunks you **ARE** expected to run :
```{r chunk_run}
# chunk_run
## Example of chunk to run
getwd()
```
### Chunks you **ARE NOT** expected to run
Hopefully, we will demo them. You may run them by yourself, young
padawan, but if you get late or worst, break something, *you*'ll be the
one to blame !
```{r chunk_demo, class.source="notrun", class.output="notruno"}
# chunk_demo
## Example of chunk to NOT run
utils::sessionInfo()
```
### Questions
Some questions we would like you to answer during the training course.
Please be fair and do not take a look at the answer before we tell you
so :)
```{r chunk_question, class.source="question", eval = FALSE}
# chunk_question
## What's the Answer to Life ? The Universe ? Everything ??
```
### Answers
They usually follow questions...
```{r chunk_answer, class.source = c("fold-hide", "answer"), class.output="answero"}
# chunk_answer
cat("The Answer to the Great Question... is ...
Forty-two. Six by nine : forty-two. That's it.
That's all there is. (I always thought something was
fundamentally wrong with the universe.)")
```
### Additional exercises / questions
These are beyond the scope of the training course, but you may play it
*if you feel bored*
```{r chunk_beyond, class.source="beyond", class.output="beyondo"}
# chunk_beyond
## Example of chunk to play with but beyond the scope
getwd()
```
------------------------------------------------------------------------
------------------------------------------------------------------------
# Start Rstudio
- Using the [OpenOnDemand/Rstudio cheat
sheet](https://moodle.france-bioinformatique.fr/pluginfile.php/1475/mod_folder/content/0/OoD_R_Rstudio.html){target="_blank"},
connect to the [OpenOnDemand
portal](https://ondemand.cluster.france-bioinformatique.fr){target="_blank"}
and **create a Rstudio session** with the right resource
requirements.
------------------------------------------------------------------------
------------------------------------------------------------------------
# Warm-up
- We now set **common parameters** as new variables, once and for all
for this session :
```{r setparam}
# setparam
## Set your project name
# WARNING : Do not just copy-paste this ! It's MY project name ! Put YOURS !!
project_name <- "ebaii_sc_teachers"
## Control if the project_name exists on the cluster
cat('PATH CHECK : ', dir.exists(paste0('/shared/projects/', project_name)))
## Seed for the RNG
my_seed <- 1337L
## Empty droplets max p-value
max_p <- 1E-03
```
------------------------------------------------------------------------
------------------------------------------------------------------------
# Prepare the data structure
- During the current and remaining training sessions, we will work on
different **data** **objects**, some retrieved from **external**
resources, some we will **create**.
- In order to reuse these for further steps, or other sessions, we
need to **store** them, thus create a **directory structure** on the
cluster.
- We will create this structure in **your project** folder (the one
you registered at the IFB Core, in which you stored your own data
for the tutoring).
- To ensure that *you* will always use the same structure for *your*
work, the code hereafter has been writen so that you just have to
define a single **variable** that corresponds to your project name.
- In *MY* case, as a SC training teacher, it is
**`ebaii_sc_teachers`**.
## Main directory
- Create the training course directory
```{r maindir, fold.output = FALSE}
# maindir
## Prepare the path in a "TD_dir" variable
TD_dir <- paste0("/shared/projects/", project_name, "/SC_TD")
## Create the root directory
dir.create(path = TD_dir, recursive = TRUE)
## Print the root directory on-screen
print(TD_dir)
```
## Current session
```{r sessiondir, fold.output = FALSE}
# sessiondir
## Create the session (Preproc.1) directory
session_dir <- paste0(TD_dir, "/01_Preproc.1")
dir.create(path = session_dir, recursive = TRUE)
## Print the session directory on-screen
print(session_dir)
```
## Input data directory
```{r indir, fold.output=FALSE}
# indir
## Create the INPUT data directory
input_dir <- paste0(session_dir, "/DATA")
dir.create(path = input_dir, recursive = TRUE)
## Print the input directory on-screen
print(input_dir)
```
## Output results directory
```{r outdir, fold.output=FALSE}
# outdir
## Create the OUTPUT data directory
output_dir <- paste0(session_dir, "/RESULTS")
dir.create(path = output_dir, recursive = TRUE)
## Print the output directory on-screen
print(output_dir)
```
[**Note :**]{.underline} *Actually, for this session we won't export any
result, so creating this **`RESULTS`** folder was just for practice
purpose.*
Now, the we have everything needed to :
- **load** data (`input_dir`)
- **save** the results we will generate (`output_dir`).
------------------------------------------------------------------------
------------------------------------------------------------------------
# Additional resources
**However**, some upcoming steps require a bit complicated methods that
are not fun to write, nor copy-paste multiple times. So, we will now
prepare code **functions** that will ease these steps. After executing
these functions code, we may call (invoke) them when needed.
## Functions
### SoupX helper function
This function was written to ease the use of SoupX on a count matrix
It was tested on `SoupX`=1.6.2 with `igraph`=1.5.1, `Seurat`=4.4.0,5.1.0
**Parameters** :
- `scmat_filt` : (sparse)matrix corresponding to an empty droplets -filtered count matrix
- `scmat_raw` : (sparse)matrix corresponding to an NON- empty droplets -filtered count matrix
- `soupQuantile`, `contaminationRange` : see `?SoupX::autoEstCont`
- `contaminationRange` : range of the *expected* soup proportion
- `soupRange` : estimate soup fraction from features with total reads
comprised in this range of counts (only used when `scmat_raw != NULL`)
- `return_object` : if `TRUE`, return the "unsouped" count matrix ; if
`FALSE`, only perform soup estimation and return the estimated
proportion value (rho)
- `doPlot` : Perform the SoupX estimation plot (rho distribution)
```{r soupx_func}
# soupx_func
SoupX_auto <- function(scmat_filt = NULL, scmat_raw = NULL, soupQuantile = 0.9,
contaminationRange = c(.01, .8), soupRange = c(0,100),
return_object = FALSE, doPlot = FALSE) {
## Checks
if(is.null(scmat_filt)) stop('A filtered count matrix is required !')
if(is.null(scmat_raw)) message('No unfiltered raw counts matrix provided. Estimation will be based on filtered matrix only.')
## If no raw matrix
if (is.null(scmat_raw)) {
spChanRaw <- SoupX::SoupChannel(
tod = scmat_filt,
toc = scmat_filt,
calcSoupProfile = FALSE)
sc_rowsum <- sparseMatrixStats::rowSums2(scmat_filt)
spProf <- data.frame(
row.names = rownames(scmat_filt),
est = sc_rowsum/sum(scmat_filt),
counts = sc_rowsum)
spChan <- SoupX::setSoupProfile(spChanRaw, spProf)
} else {
spChan <- SoupX::SoupChannel(
tod = scmat_raw,
toc = scmat_filt,
calcSoupProfile = FALSE)
if (min(spChan$nDropUMIs) > max(soupRange)) stop(
'Minimum found counts per barcode is : ',
min(spChan$nDropUMIs),
', which is smaller than the upper bound of soupRange ! Please increase soupRange max !')
spChan <- SoupX::estimateSoup(sc = spChan, soupRange = soupRange)
}
## Display Top 20 contributing genes
if (!return_object) {
cat('\nSoup-contributing features (Top 20) :\n')
print(knitr::kable(head(
spChan$soupProfile[order(spChan$soupProfile$est, decreasing = TRUE), ],
n = 20)))
}
## Quick clustering needed
spClust <- scran::quickCluster(scmat_filt, method = "igraph")
## Adding clusters to the SoupChannel object
spChan <- SoupX::setClusters(sc = spChan, clusters = spClust)
## Estimating soup
sX <- SoupX::autoEstCont(sc = spChan, doPlot = doPlot, tfidfMin = 1,
soupQuantile = soupQuantile, maxMarkers = 100,
contaminationRange = contaminationRange,
rhoMaxFDR = .2, priorRho = .05, priorRhoStdDev = .1,
forceAccept = FALSE)
## Removing soup (adjusting counts)
if(return_object) {
cat('Counts BEFORE SoupX : ', sum(scmat_filt), '\n')
scmat_soupx <- SoupX::adjustCounts(
sX, method = 'subtraction', roundToInt = TRUE,
tol = .001, pCut = .01)
cat('Counts AFTER SoupX : ', sum(scmat_soupx), '\n')
rm(scmat_filt)
return(scmat_soupx)
} else return(sX$fit$rhoEst)
}
```
------------------------------------------------------------------------
------------------------------------------------------------------------
# Load a 10X Cell Ranger raw count matrix
## The dataset
- The data we will work on is hosted on
[Zenodo](https://zenodo.org){target="_blank"}. To ease its
retrieval, we can define its ID as a variable
```{r zid}
# zid
## This is actually to help me updating this training without hassle
zen_id <- '14034221'
## This is the path to the current EB3I backup
sessionid <- '2538_eb3i_n1_2025'
```
- Here will we train ourselves to load into R a single cell RNAseq
data produced by 10X Genomics' software **`Cell Ranger`**.
- We will work with a public dataset provided by the manufacturer,
that consists in \~ **10,000 PBMC** (peripheral bone marrow cells)
from a human donor (downsampled to 1/10th, for smaller computation time).
- The experiment was performed with the **3' capture kit v3**.
- The analysis was performed with **`Cell Ranger v3`**, mapping on the
**`GRCh38-2020-A`** manufacturer reference.
- The data consists into the typical 10x 3-files structure, hosted in
a Zenodo respository (Id :
[`r zen_id`](https://zenodo.org/records/%60r%20zen_id%60 "Zenodo Preproc.1"){target="_blank"})
- `barcodes.tsv.gz` : the list of putative droplet **barcodes**
- `features.tsv.gz` : the list of **features** (genes)
- `matrix.mtx.gz` : the **feature** x **barcode** expression count
**matrix**
## Download data
- The relatively complex code chunk hereafter has been written to :
- let you download the data files from :
- Zenodo by default
- if `backup = TRUE`, a local backup instead (in case we loose the
internet connection)
- if the requested data are already available locally :
- load this from the local source
- if `force = TRUE` a new download is forced
```{r dlzen10x, message = TRUE, echo = TRUE}
# dlzen10x
### Named files (will be used later on !)
mtx_file <- "matrix.mtx.gz"
features_file <- "features.tsv.gz"
barcodes_file <- "barcodes.tsv.gz"
summary_file <- "pbmc_10k_v3_summary.html"
## Filename(s) to retrieve
toget_files <- c(mtx_file,
features_file,
barcodes_file,
summary_file)
## Folder to store retrieved files
local_folder <- input_dir
## Use local backup ?
backup <- FALSE
if(backup) message("Using local backup !")
## Force download ?
force <- FALSE
if(force) message("Forcing (re)download !")
### Define remote folder
remote_folder <- if (backup) paste("/shared/projects/", sessionid, "/atelier_scrnaseq/TD/BACKUP/10X/") else paste0("https://zenodo.org/records/", zen_id, "/files/")
### Reconstruct the input paths
remote_path <- paste0(remote_folder, "/", toget_files)
### Reconstruct the output paths
local_path <- paste0(local_folder, "/", toget_files)
## Retrieve files (if they don't exist), in loop
for (tg in seq_along(toget_files)) {
## If the file does not locally exist
if (!file.exists(local_path[tg]) | force) {
## Retrieve data
if(backup) {
file.copy(from = remote_path[tg],
to = local_path[tg])
} else {
download.file(url = remote_path[tg],
destfile = local_path[tg])
}
## Check if downloaded files exist locally
if(file.exists(local_path[tg])) message("\tOK")
} else message(paste0(toget_files[tg], " already downloaded !"))
}
```
## Load into R
- **Load** the **10X** **data** files into R :
```{r q_r10X, eval = FALSE, class.source="question", eval = FALSE}
# q_r10X
How ?
```
- A good starter
```{r a_r10X1, class.source = c("fold-hide", "answer"), echo = FALSE}
# a_r10X1
cat('https://lmddgtfy.net/?q=load%2010x%20data%20in%20R')
```
- Back to your local help ressource
```{r a_r10X2, class.source = c("fold-hide", "answer"), eval = FALSE}
# a_r10X2
## Reading the function help page
?Seurat::Read10X
```
We can now apply it to our data :
```{r load10x, class = "fold-hide"}
# load10x
## Loading unfiltered 10X matrix
scmat <- Seurat::Read10X(data.dir = input_dir)
```
- However, we don't know what the loaded object actually looks like in
R...
- **Questions** :
- Do you know a way to ...
```{r q_whatis, class.source="question", eval = FALSE}
# q_whatis
... know what IS the type of object that was created ?
```
```{r a_whatis, class.source = c("fold-hide", "answer"), eval = FALSE}
# a_whatis
## To know the type of an R object : methods::is()
?methods::is
```
```{r gois, class.source = "fold-hide"}
# gois
## Get the type
methods::is(scmat)
```
```{r q_struc, class.source="question", eval = FALSE}
# q_struc
... observe the STRucture of the resulting object ?
```
```{r a_struc, class.source = c("fold-hide", "answer"), eval = FALSE}
# a_struc
## To get a basic structure description : utils::str()
?utils::str
```
```{r go_str, class.source = "fold-hide"}
# go_str
## Get the basic structure
utils::str(scmat)
```
Not that meaningful, though... At least we know we have a matrix, so one can
get its dimensions !
```{r scmat_dim}
# scmat_dim
## Dimensions of a >1D object
dim(scmat)
```
## Features cleanup
`Seurat` does not like at all **underscores** (`_`) in feature names ...
We will have to :
- **control** if our dataset contains some ? Here, we will use the `base::grep()` function that looks in character strings `x` if they contain a provided set of characters as a `pattern`, and returns their position (when `value = FALSE`, default) or value itself (`value = TRUE`)
```{r us_check1}
# us_check1
## Check feature names with underscore(s)
base::grep(pattern = '_',
x = rownames(scmat),
value = TRUE)
```
- if so, **clean** it ! Here we will use the `base::gsub()` function that looks for a `pattern` in a character string `x` and substitutes it with a `replacement` character string.
```{r us_clean}
# us_clean
## Clean features : replacing underscores by dashes
rownames(scmat) <- base::gsub(
pattern = "_",
replacement = "-",
x = rownames(scmat))
```
- then, **control** the efficiency of our cleaning (with a new use of `base::grep`)
```{r us_check2, class.source="notrun", class.output="notruno"}
# us_check2
## Post-cleanup control
base::grep(pattern = "_",
x = rownames(scmat),
value = TRUE)
```
------------------------------------------------------------------------
------------------------------------------------------------------------
# Empty droplets
- While a large part of our barcodes contain **no count** at all,
barcodes with at least one total count **may not** mandatorily
correspond to **true cells**, due to ambient RNA.
- Here we want to **identify** empty droplets, then **remove** them
from the matrix.
## Detection
- To detect the margins of "true" cells versus empty droplets, we will
rely on the so-called **`double-kneeplot`** : a plot of **UMI counts per
barcode**, in function of their increasingly **ordered rank**.

- To draw such plot, we need to **rank the barcodes** thanks to their
total counts.
### Ranking barcodes
- We will use the
[**DropletUtils**](https://bioconductor.org/packages/release/bioc/html/DropletUtils.html){target="_blank"} package
```{r h_barcodeRanks, eval = FALSE}
# h_barcodeRanks
## Read the function help page
?DropletUtils::barcodeRanks()
```
```{r bcranks}
# bcranks
## Generate the rank statistics
bc_rank <- DropletUtils::barcodeRanks(scmat)
```
- Check the ranking of barcodes, according to their counts
```{r bcrank_top, class.source="notrun", class.output="notruno", fold.output = FALSE}
# bcrank_top
## Show a summary of the ranks (ordered for display convenience)
print(bc_rank[order(bc_rank$rank),])
```
### Kneeplot
Now, we can **draw** our kneeplot :
```{r kneeplot1, class.source="notrun", class.output="notruno", fold.plot = FALSE}
# kneeplot1
## Plot
graphics::plot(x = bc_rank$rank,
y = bc_rank$total + 1,
log = "xy",
xlab = "Barcode rank",
ylab = "Total UMIs",
col = "black",
pch = ".",
cex = 5,
main = "Kneeplot")
```
- **Question** :
```{r q_knee, class.source="question", eval = FALSE}
# q_knee
Looking at the kneeplot, can you roughly predict how many barcodes will be
kept as cells (ie, as non-empty droplets) ?
```
```{r a_knee, class.source = c("fold-hide", "answer"), eval = FALSE}
# a_knee
## . The "cliff" in the kneeplot is at a
## rank value a bit below ~ 1,000.
```
Then, we can use these ranks to **identify "true" cells** :
```{r h_emptyDrops, class.source="notrun", class.output="notruno", eval = FALSE}
# h_emptyDrops
## Read the function help page
?DropletUtils::emptyDrops
```
The function we will use relies on random number generation, so we will fix the RNG seed, for reproducibility
```{r emptydrops1}
# emptydrops1
## Set the RNG seed
set.seed(my_seed)
## Identify empty droplets
bc_rank2 <- DropletUtils::emptyDrops(scmat)
## Whats is the structure of the returned output ?
str(bc_rank2)
```
We can show a summary of the ordered barcodes :
```{r emptysmry, class.source="notrun", class.output="notruno"}
# emptysmry
## Show a summary of the qualified ranks (ordered for display convenience)
print(bc_rank2[order(bc_rank2$Total, decreasing = TRUE),])
```
### Selection
Barcodes considered as empty droplets have **no p-value** (`FDR` = NA)
```{r bcr2_na}
# bcr2_na
## Creating a validation column (empty droplets have NA as p-value)
bc_rank2$VALID <- !is.na(bc_rank2$FDR)
## Quantify "real cells"
table(bc_rank2$VALID)
```
We can **control the stringency** by setting a FDR-adjusted p-value
**threshold**
```{r valid_fdr}
# valid_fdr
## Recall the max p-value we set as a variable early on
max_p
## We've set the max p-value to 1E-03
bc_rank2$VALID[bc_rank2$FDR >= max_p] <- FALSE
## Control
table(bc_rank2$VALID)
```
Now, we can **display** our **true cells** on the kneeplot :
```{r kneeplot2, fig.align="center"}
# kneeplot2
## Plot with highlight of the selected barcodes (red)
graphics::plot(x = bc_rank$rank,
y = bc_rank$total+1,
log = "xy",
xlab = "Barcode rank",
ylab = "Total UMIs",
col = ifelse(bc_rank2$VALID, "red", "black"),
pch = ".",
cex = ifelse(bc_rank2$VALID, 7, 3),
main = paste0("Kneeplot (",
length(which(bc_rank2$VALID)),
" barcodes kept)"))
```
- **Question** :
```{r q_kneered, class.source="question", eval = FALSE}
# q_kneered
Is there something unexpected for the retained "TRUE" cells (red) in this kneeplot ?
```
```{r a_kneered, class.source = c("fold-hide", "answer"), eval = FALSE}
# a_kneered
## . You may observe that the selection of "true" cells (red)
## does not strictly correspond to a "cut" in the kneeplot
## descending curve.
##
## . This is because emptyDrops does not only "follow" the curve
## to determine a threshold corresponding to a "minimal count"
## value to consider a barcode as a cell, but also performs a
## statistical analysis for each barcode, considering how much
## its expression profile ressembles the one of other cells,
## even with a very different (lower) global level of expression.
```
## Removal
We just need to restrict our count matrix to "true" cells.
- How many barcodes do we have at the moment ?
```{r scdim3, class.source="notrun", class.output="notruno"}
# scdim3
dim(scmat)
```
- Filter empty droplets out :
```{r edfilter}
# edfilter
## Restrict to valid barcodes
scmat_cells <- scmat[, bc_rank2$VALID]
## Control
dim(scmat_cells)
```
------------------------------------------------------------------------
------------------------------------------------------------------------
# Ambient RNA
- The counts measured in this matrix of (now considered "true") single
cells do **not** reflect the measurement of these cells
expression **only**
- Due to other **over-lysed / dying cells** in the emulsion, we also
measured fraction of **ambient RNA** that adds up to the "real"
expression
- We may **estimate** it by observing counts existing at a **minimal
level** across cells that greatly differ in their expression profile
- Just in case, due to the fun tool name we will use, `SoupX`, I'll
write down "soup" as an alias for "ambient RNA", but both terms have
the same meaning.
## Estimate the "soup"
[**SoupX**](https://cran.r-project.org/web/packages/SoupX/readme/README.html){target="_blank"}
can estimate the rate and composition of ambient RNA, in several ways :
- **Manual mode** : providing a list of features expected to reflect
the soup (genes expressed at high levels in a majority of the
expected cell types)
- **Automatic modes**, using the empty droplets-**filtered** raw count
matrix ...
- ... along **with the unfiltered** matrix (ie, containing all
quantified droplets). This is the most efficient mode.
- ... **solely** : this is less efficient, but useful when the
unfiltered matrix is not available.
```{r h_SoupX_auto, eval = FALSE, class.source="notrun", class.output="notruno"}
# h_SoupX_auto
## Reading the SoupX package main help page
?SoupX::SoupX
```
Let's run SoupX, just asking for an estimation of the "soup" proportion
:
```{r soupxrhoE}
# soupxrhoE
## TIP : In case of error :
# install.packages("irlba", type="source", force=TRUE)
## SoupX for ambient fraction estimation
soup_frac <- SoupX_auto(
scmat_filt = scmat_cells,
scmat_raw = scmat)
```
***NOTE*** : We can observe
[**MALAT1**](https://www.biorxiv.org/content/10.1101/2024.07.14.603469v2){target="_blank"} as the top first gene (as well as multiple riboprotein-coding genes).
What's the estimated fraction of soup in our counts ?
```{r soupxfrac}
# soupxfrac
cat("Soup fraction : ", soup_frac)
```
- **Question** :
```{r q_souprate, class.source="question", eval = FALSE}
# q_souprate
Should we try to remove such an amount of ambient RNA ?
```
```{r a_souprate, class.source = c("fold-hide", "answer"), eval = FALSE}
# a_souprate
The best way to know is to try it out =D
```
## Remove the "soup"
SoupX can **remove** the soup (subtract counts for identified soup genes
to all barcodes/cells) using different methods :
- **Automatically** (safer)
- Specifying a **fraction of counts** to remove.
- *Pros* :
- By removing X % of reads :
- if soup was effectively present (at a rate \<= X), this
will **mostly affect soup** genes first.
- If not, it will **decrease counts** globally, thus have
negligible effect in differences between cells / cell
types.
- *Cons* :
- **Counts are** already **rare** in droplet-based single cell
technologies... Removing them "by default" makes them **even
more rare**...
- If actual soup **rate is higher** than the defined rate X to
remove, this mode will remain **uneffective**...
```{r soupxrm}
# soupxrm
## Run SoupX in "removal" mode
scmat_unsoup <- SoupX_auto(
scmat_filt = scmat_cells,
scmat_raw = scmat,
return_object = TRUE)
```
## Effect of the "soup" removal
- Now, we want to **characterize the effect** of the removal of the
ambient RNA, by comparing the pre- and post- soupX matrices.
- An easy way is to **describe**, then **visualize** such an effect.
### Describe and visualize {.tabset .tabset-fade .tabset-pills}
#### Before SoupX (DEMO)
We can describe our count matrix thanks to some useful metrics :
- Total counts
```{r descBS_totcount, class.source="notrun", class.output="notruno"}
# descBS_totcount
## All counts
base::sum(scmat_cells)
```
- Sparsity (fraction of `0`s)
```{r descBS_sparsity, class.source="notrun", class.output="notruno"}
# descBS_sparsity
## Compute sparsity (number of matrix values at 0 / cross-product of matrix dim)
base::sum(sparseMatrixStats::colCounts(
x = scmat_cells, value = 0)
) / base::prod(dim(scmat_cells))
```
- Distribution of counts in cells
```{r descBS_nCount, class.source="notrun", class.output="notruno"}
# descBS_nCount
## Sums up counts for each cell
base::summary(sparseMatrixStats::colSums2(x = scmat_cells, na.rm = TRUE))
```
- Distribution of expressed genes in cells
```{r descBS_nFeature, class.source="notrun", class.output="notruno"}
# descBS_nFeature
## All genes minus 0-count genes
base::summary(nrow(scmat_cells) - sparseMatrixStats::colCounts(
x = scmat_cells,
value = 0,
na.rm = TRUE))
```
#### After SoupX
We can describe our count matrix thanks to some useful metrics :
- Total counts
```{r descAS_totcount}
# descAS_totcount
## All counts
base::sum(scmat_unsoup)
```
- Sparsity (fraction of `0`s)
```{r descAS_sparsity}
# descAS_sparsity
## Compute sparsity (number of matrix values at 0 / cross-product of matrix dim)
base::sum(sparseMatrixStats::colCounts(
x = scmat_unsoup, value = 0)
) / base::prod(dim(scmat_unsoup))
```
- Distribution of counts in cells
```{r descAS_nCount}
# descAS_nCount
## Sums up counts for each cell
base::summary(sparseMatrixStats::colSums2(x = scmat_unsoup, na.rm = TRUE))
```
- Distribution of expressed genes in cells
```{r descAS_nFeature}
# descAS_nFeature
## Compute sparsity (number of matrix values at 0 / cross-product of matrix dim)
base::summary(nrow(scmat_cells) - sparseMatrixStats::colCounts(
x = scmat_unsoup,
value = 0,
na.rm = TRUE))
```
## {.unnumbered}
------------------------------------------------------------------------
------------------------------------------------------------------------
- **Questions** :
```{r q_descdiff, class.source="question", eval = FALSE}
# q_descdiff
Can you describe the difference (especially for counts and sparsity) ?
Was it expected ?
```
```{r a_descdiff, class.source = c("fold-hide", "answer"), eval = FALSE}
# a_descdiff
## . Total counts, max value, counts per cell and number of
## expressed features per cell, all have decreased.
##
## . The sparsity level has slightly increased (removing the soup
## obliterated all counts for some features in some cells).
##
## . All of this is absolutely expected.
```
What are the features most affected by the soup removal ?
```{r soupfeatures, class.source="notrun", class.output="notruno"}
# soupfeatures
## Compute the fraction matrix (PRE / POST)
## NOTE : we may have 0 counts in the divider,
## so we increment both matrix by +1 !
cont_frac <- (scmat_cells + 1) / (scmat_unsoup + 1)
## Fraction of counts decreased by soup removal, per feature (ordered)
feat_frac <- sort(sparseMatrixStats::rowMeans2(cont_frac), decreasing = TRUE)
### Display Top 5 features
utils::head(feat_frac, decreasing = TRUE)
## Removing objects to free some RAM
rm(cont_frac, feat_frac)
```
```{r qsoupfeat, class.source="question", eval = FALSE}
# qsoupfeat
Does something about these genes strike you ?
```
Hint : **fraction** or **difference** ?
```{r soupfeaturesdiff, class.source=c("fold-hide", "answer"), class.output="answero"}
# soupfeaturesdiff
## Compute the DIFFERENCE matrix (PRE / POST)
## NOTE : we may have 0 counts in the divider,
## so we increment both matrix by +1 !
cont_dif <- scmat_cells - scmat_unsoup
## Fraction of counts decreased by soup removal, per feature (ordered)
feat_dif <- sort(sparseMatrixStats::rowMeans2(cont_dif), decreasing = TRUE)
### Display Top 5 features
utils::head(feat_dif, decreasing = TRUE)
## Removing objects to free some RAM
rm(cont_dif, feat_dif)
```
### Plot the "top 1" feature {.tabset .tabset-fade .tabset-pills}
We want to visualize the expression of the "top" soup gene, LYZ, in our dataset.
To do so, we will generate a projection of the data into a reduced (2-D)
mathematical space. This is done by using a series of R function from the
Seurat package, that we won't describe here and now : understanding these
commands below is not the scope of our current session, but it will be during
the next few days.
Actually, these commands perform all the steps we'll see in the next few days,
in a row, but with _default values_ (thus, almost certainly inadequate to our
current data).
In consequence, please just consider the output as a way to quickly and dirtily
visualize the data : the depicted topology should be considered as very rough
and mostly imperfect.
#### Before SoupX
```{r bsoupx, fig.height = 6, fig.width = 6, class.source="notrun", class.output="notruno"}
# bsoupx
## Roll process, using a temporary "sobj_presoupx" Seurat object
sobj_presoupx <- Seurat::CreateSeuratObject(counts = scmat_cells, project = "PBMC10K_BeforeSoupX")
sobj_presoupx <- Seurat::NormalizeData(object = sobj_presoupx, verbose = FALSE)
sobj_presoupx <- Seurat::ScaleData(object = sobj_presoupx, verbose = FALSE)
sobj_presoupx <- Seurat::FindVariableFeatures(object = sobj_presoupx, verbose = FALSE)
sobj_presoupx <- Seurat::RunPCA(object = sobj_presoupx, npcs = 21, verbose = FALSE)
sobj_presoupx <- Seurat::RunUMAP(object = sobj_presoupx, dims = c(1:20), verbose = FALSE)
## Top1 feature plot
fp_pre <- Seurat::FeaturePlot(object = sobj_presoupx, features = 'LYZ') + Seurat::DarkTheme()
print(fp_pre)
```
#### After SoupX
```{r asoupx, fig.height = 6, fig.width = 6}
# asoupx
## Roll process, using a temporary "sobj_postsoupx" Seurat object
sobj_postsoupx <- Seurat::CreateSeuratObject(counts = scmat_unsoup, project = "PBMC10K_AfterSoupX")
sobj_postsoupx <- Seurat::NormalizeData(object = sobj_postsoupx, verbose = FALSE)
sobj_postsoupx <- Seurat::ScaleData(object = sobj_postsoupx, verbose = FALSE)
sobj_postsoupx <- Seurat::FindVariableFeatures(object = sobj_postsoupx, verbose = FALSE)
sobj_postsoupx <- Seurat::RunPCA(object = sobj_postsoupx, npcs = 21, verbose = FALSE)
sobj_postsoupx <- Seurat::RunUMAP(object = sobj_postsoupx, dims = c(1:20), verbose = FALSE)
## Top1 feature plot
fp_post <- Seurat::FeaturePlot(object = sobj_postsoupx, features = 'LYZ') + Seurat::DarkTheme()
print(fp_post)
```
## {.unnumbered}
------------------------------------------------------------------------
------------------------------------------------------------------------
Merging plots for ease of use :
```{r soupx_umaps, fig.width = 12, fig.height = 6, class.source="beyond", class.output="beyondo"}
# soupx_umaps
## Using the patchwork package to merge plots (and ggplot2 to add titles)
patchwork::wrap_plots(
list(
fp_pre & ggplot2::ggtitle(label = "LYZ before SoupX"),
fp_post & ggplot2::ggtitle(label = "LYZ after SoupX")),
nrow = 1)
```
- **Question**
```{r q_umapsoupx, class.source="question", eval = FALSE}
# q_umapsoupx
Can you compare the two plots :
- for the LYZ expression across the 2-D space ?
- about the space topologies ?
- for the LYZ expression globally ?
```
```{r a_umapsoupx, class.source = c("fold-hide", "answer"), eval = FALSE}
# a_umapsoupx
## . The level of expression measured in cells with smaller expression
## before SoupX has gone to almost none after : the ambient expression
## of this gene has efficiently been removed.
##
## . Hopefully, the cluster(s) with higher expression remain(s) high.
##
## . The expression scale upper bound after SoupX is higher than before ?!
## But we REMOVED some counts ?! While being counter-intuitive, this is
## a positive consequence of the ambient removal on the scaling of
## barcodes expression (see later).
##
## . The range of both X and Y axes have increased, depicting a better
## separation of / distance between some of the clusters.
##
## . The global topology of clusters remains close (but not identical).
##
## . Clusters compacity has increased.
```
## Conclusion
- Removing ambient RNA has a **positive effect** :
- on the **quality of the expression** level measurements
- on the observed **topology**
- ... despite an estimation of only ~5% !
- Actually, ambient RNA level and composition is (one of) the major source(s)
of the **batch effect bias** that may alter the integration of different
samples.
------------------------------------------------------------------------
------------------------------------------------------------------------
**Beyond** :
1. One can check if the "unsouped" (ie, post-SoupX) matrix retain some soup ?. How would you do it ?
```{r b_soupx2, class.source = c("fold-hide", "beyond"), eval = FALSE}
# b_soupx2
## SoupX for ambient fraction estimation
soup2_frac <- SoupX_auto(
scmat_filt = scmat_unsoup,
scmat_raw = scmat)
## Run SoupX in "removal" mode
scmat_unsoup2 <- SoupX_auto(
scmat_filt = scmat_unsoup,
scmat_raw = scmat,
return_object = TRUE)
```
2. Save the `sobj_presoupx` R object on disk.
```{r b_save_answer1, class.source = c("fold-hide", "beyond"), eval = FALSE}
# b_save_answer1
## A single object : use RDS format with saveRDS()
saveRDS(object = sobj_presoupx, file = paste0(output_dir, "/sobj_presoupx.RDS"))
```
3. Save both the `sobj_presoupx` and `sobj_postsoupx` objects in a single archive
on disk.
```{r b_save_answer2, class.source = c("fold-hide", "beyond"), eval = FALSE}
# b_save_answer2
## Multiple objects : use Rdata format with save()
save(list = c(sobj_presoupx, sobj_postsoupx), file = paste0(output_dir, "/sobjects.RDA"))
```
4. Same as 1. but use the `bzip2` compression algorithm
```{r b_save_answer3, class.source = c("fold-hide", "beyond"), eval = FALSE}
# b_save_answer3
## Compress with bzip2
saveRDS(object = sobj, file = paste0(output_dir, "/sobj.RDS"), compress = "bzip2")
```
------------------------------------------------------------------------
------------------------------------------------------------------------
# Rsession
For reproducibility and context, it is recommended to include in your RMarkdown the list of loaded packages and their version.
```{r rsession, class.source="notrun", class.output="notruno"}
# rsession
utils::sessionInfo()
```