--- title: "
EB3I n1 2025 scRNAseq
-
PRE-PROCESSING (I)
-
Load a count matrix, empty droplets & ambient RNA filtering
" date: "2025-16-21.22" author: - name: "EB3I n1 scRNAseq Team" - name: "Bastien JOB" email: "bastien.job@gustaveroussy.fr" - name: "William JARASSIER" email: "w.jarassier@gmail.com" output: rmdformats::readthedown: fig_width: 8 fig_height: 6 highlight: tango ## Theme for the code chunks embed_fonts: TRUE number_sections: true ## Adds number to headers (sections) theme: flatly ## CSS theme for the HTML page collapsed: true ## By default, the TOC is folded toc_depth: 3 smooth_scroll: true ## Smooth scroll of the HTML page self_contained: true ## Includes all plots/images within the HTML code_download: true ## Adds a button to download the Rmd code_folding: show thumbnails: false lightbox: true fig_caption: false gallery: true use_bookdown: true always_allow_html: true ## Allow plain HTML code in the Rmd editor_options: markdown: wrap: 72 --- ```{r knit_setup, echo = FALSE} knitr::opts_chunk$set( echo = TRUE, # Print the code eval = TRUE, # Run command lines message = FALSE, # Print messages prompt = FALSE, # Do not display prompt comment = NA, # No comments on this section warning = FALSE, # Display warnings tidy = FALSE, fig.align="center", # results = 'hide', width = 100 # Number of characters per line ) ``` ```{css, echo=FALSE} .notrun { background-color: lightgrey !important; border: 3px solid black !important; } .notruno { background-color: lightgrey !important; color : black !important; } .question { background-color: aquamarine !important; color : black !important; border: 3px solid limegreen !important; } .questiono { background-color: aquamarine !important; color : black !important; } .answer { background-color: navajowhite !important; border: 3px solid brown !important; } .answero { background-color: navajowhite !important; color : black !important; } .beyond { background-color: violet !important; border: 3px solid purple !important; } .beyondo { background-color: violet !important; color : black !important; } ``` ```{r knit_hook, echo = FALSE} hooks = knitr::knit_hooks$get() hook_foldable = function(type) { force(type) function(x, options) { res = hooks[[type]](x, options) if (isFALSE(options[[paste0("fold.", type)]])) return(res) paste0( "
Show ", type, "\n\n", res, "\n\n
" ) } } knitr::knit_hooks$set( output = hook_foldable("output"), plot = hook_foldable("plot") ) ``` ------------------------------------------------------------------------ ------------------------------------------------------------------------
![](eb3i_banner.png)
------------------------------------------------------------------------ ------------------------------------------------------------------------ # PREAMBLE ## Purpose of this session This file describes the different steps to perform the **first part** of the **single cell RNAseq** data analysis training course for the **EB3I n1 2025**, covering these steps : - **Load** a single cell raw counts **matrix** - **Remove** barcodes from **empty droplets** - **Estimate** and remove **ambient RNA** contamination *("soup")* ## An important notice
![](gyro.png)
- These early (but mandatory) steps of the analysis are not covered by the renowned, higher-level analysis frameworks like **Seurat** nor **scran/scater**. - Thus, we will write/use a certain amount of **lines of code**, some of which of a higher complexity for R beginners. - **Newcomer, do not fear** ! - While you are **welcome to try** writing some by yourself, depending on your R knowledge and self-confidence in it ... - ... you do not have to know [**how**]{.underline} these code work ... - ... thus you **should not feel ashamed** copy-pasting the pre-written code hereafter ... - ... but in both cases, please try **your best understanding** [**what**]{.underline} these code do, and even more, [**why**]{.underline} ! - In this purpose, you may use whatever **help** you can **by** **searching**. ## What to run ? - This training presentation, written in Rmarkdown, contains **many** chunks ( = code blocks) - You **do not have to** run them all ! - Chunks follow a **color code** : ### Chunks you **ARE** expected to run : ```{r chunk_run} # chunk_run ## Example of chunk to run getwd() ```
### Chunks you **ARE NOT** expected to run Hopefully, we will demo them. You may run them by yourself, young padawan, but if you get late or worst, break something, *you*'ll be the one to blame ! ```{r chunk_demo, class.source="notrun", class.output="notruno"} # chunk_demo ## Example of chunk to NOT run utils::sessionInfo() ```
### Questions Some questions we would like you to answer during the training course. Please be fair and do not take a look at the answer before we tell you so :) ```{r chunk_question, class.source="question", eval = FALSE} # chunk_question ## What's the Answer to Life ? The Universe ? Everything ?? ```
### Answers They usually follow questions... ```{r chunk_answer, class.source = c("fold-hide", "answer"), class.output="answero"} # chunk_answer cat("The Answer to the Great Question... is ... Forty-two. Six by nine : forty-two. That's it. That's all there is. (I always thought something was fundamentally wrong with the universe.)") ```
### Additional exercises / questions These are beyond the scope of the training course, but you may play it *if you feel bored* ```{r chunk_beyond, class.source="beyond", class.output="beyondo"} # chunk_beyond ## Example of chunk to play with but beyond the scope getwd() ```
------------------------------------------------------------------------ ------------------------------------------------------------------------ # Start Rstudio - Using the [OpenOnDemand/Rstudio cheat sheet](https://moodle.france-bioinformatique.fr/pluginfile.php/1475/mod_folder/content/0/OoD_R_Rstudio.html){target="_blank"}, connect to the [OpenOnDemand portal](https://ondemand.cluster.france-bioinformatique.fr){target="_blank"} and **create a Rstudio session** with the right resource requirements. ------------------------------------------------------------------------ ------------------------------------------------------------------------ # Warm-up - We now set **common parameters** as new variables, once and for all for this session : ```{r setparam} # setparam ## Set your project name # WARNING : Do not just copy-paste this ! It's MY project name ! Put YOURS !! project_name <- "ebaii_sc_teachers" ## Control if the project_name exists on the cluster cat('PATH CHECK : ', dir.exists(paste0('/shared/projects/', project_name))) ## Seed for the RNG my_seed <- 1337L ## Empty droplets max p-value max_p <- 1E-03 ``` ------------------------------------------------------------------------ ------------------------------------------------------------------------ # Prepare the data structure - During the current and remaining training sessions, we will work on different **data** **objects**, some retrieved from **external** resources, some we will **create**. - In order to reuse these for further steps, or other sessions, we need to **store** them, thus create a **directory structure** on the cluster. - We will create this structure in **your project** folder (the one you registered at the IFB Core, in which you stored your own data for the tutoring). - To ensure that *you* will always use the same structure for *your* work, the code hereafter has been writen so that you just have to define a single **variable** that corresponds to your project name. - In *MY* case, as a SC training teacher, it is **`ebaii_sc_teachers`**. ## Main directory - Create the training course directory ```{r maindir, fold.output = FALSE} # maindir ## Prepare the path in a "TD_dir" variable TD_dir <- paste0("/shared/projects/", project_name, "/SC_TD") ## Create the root directory dir.create(path = TD_dir, recursive = TRUE) ## Print the root directory on-screen print(TD_dir) ``` ## Current session ```{r sessiondir, fold.output = FALSE} # sessiondir ## Create the session (Preproc.1) directory session_dir <- paste0(TD_dir, "/01_Preproc.1") dir.create(path = session_dir, recursive = TRUE) ## Print the session directory on-screen print(session_dir) ``` ## Input data directory ```{r indir, fold.output=FALSE} # indir ## Create the INPUT data directory input_dir <- paste0(session_dir, "/DATA") dir.create(path = input_dir, recursive = TRUE) ## Print the input directory on-screen print(input_dir) ``` ## Output results directory ```{r outdir, fold.output=FALSE} # outdir ## Create the OUTPUT data directory output_dir <- paste0(session_dir, "/RESULTS") dir.create(path = output_dir, recursive = TRUE) ## Print the output directory on-screen print(output_dir) ```

[**Note :**]{.underline} *Actually, for this session we won't export any result, so creating this **`RESULTS`** folder was just for practice purpose.*
Now, the we have everything needed to : - **load** data (`input_dir`) - **save** the results we will generate (`output_dir`). ------------------------------------------------------------------------ ------------------------------------------------------------------------ # Additional resources **However**, some upcoming steps require a bit complicated methods that are not fun to write, nor copy-paste multiple times. So, we will now prepare code **functions** that will ease these steps. After executing these functions code, we may call (invoke) them when needed. ## Functions ### SoupX helper function This function was written to ease the use of SoupX on a count matrix It was tested on `SoupX`=1.6.2 with `igraph`=1.5.1, `Seurat`=4.4.0,5.1.0 **Parameters** : - `scmat_filt` : (sparse)matrix corresponding to an empty droplets -filtered count matrix - `scmat_raw` : (sparse)matrix corresponding to an NON- empty droplets -filtered count matrix - `soupQuantile`, `contaminationRange` : see `?SoupX::autoEstCont` - `contaminationRange` : range of the *expected* soup proportion - `soupRange` : estimate soup fraction from features with total reads comprised in this range of counts (only used when `scmat_raw != NULL`) - `return_object` : if `TRUE`, return the "unsouped" count matrix ; if `FALSE`, only perform soup estimation and return the estimated proportion value (rho) - `doPlot` : Perform the SoupX estimation plot (rho distribution) ```{r soupx_func} # soupx_func SoupX_auto <- function(scmat_filt = NULL, scmat_raw = NULL, soupQuantile = 0.9, contaminationRange = c(.01, .8), soupRange = c(0,100), return_object = FALSE, doPlot = FALSE) { ## Checks if(is.null(scmat_filt)) stop('A filtered count matrix is required !') if(is.null(scmat_raw)) message('No unfiltered raw counts matrix provided. Estimation will be based on filtered matrix only.') ## If no raw matrix if (is.null(scmat_raw)) { spChanRaw <- SoupX::SoupChannel( tod = scmat_filt, toc = scmat_filt, calcSoupProfile = FALSE) sc_rowsum <- sparseMatrixStats::rowSums2(scmat_filt) spProf <- data.frame( row.names = rownames(scmat_filt), est = sc_rowsum/sum(scmat_filt), counts = sc_rowsum) spChan <- SoupX::setSoupProfile(spChanRaw, spProf) } else { spChan <- SoupX::SoupChannel( tod = scmat_raw, toc = scmat_filt, calcSoupProfile = FALSE) if (min(spChan$nDropUMIs) > max(soupRange)) stop( 'Minimum found counts per barcode is : ', min(spChan$nDropUMIs), ', which is smaller than the upper bound of soupRange ! Please increase soupRange max !') spChan <- SoupX::estimateSoup(sc = spChan, soupRange = soupRange) } ## Display Top 20 contributing genes if (!return_object) { cat('\nSoup-contributing features (Top 20) :\n') print(knitr::kable(head( spChan$soupProfile[order(spChan$soupProfile$est, decreasing = TRUE), ], n = 20))) } ## Quick clustering needed spClust <- scran::quickCluster(scmat_filt, method = "igraph") ## Adding clusters to the SoupChannel object spChan <- SoupX::setClusters(sc = spChan, clusters = spClust) ## Estimating soup sX <- SoupX::autoEstCont(sc = spChan, doPlot = doPlot, tfidfMin = 1, soupQuantile = soupQuantile, maxMarkers = 100, contaminationRange = contaminationRange, rhoMaxFDR = .2, priorRho = .05, priorRhoStdDev = .1, forceAccept = FALSE) ## Removing soup (adjusting counts) if(return_object) { cat('Counts BEFORE SoupX : ', sum(scmat_filt), '\n') scmat_soupx <- SoupX::adjustCounts( sX, method = 'subtraction', roundToInt = TRUE, tol = .001, pCut = .01) cat('Counts AFTER SoupX : ', sum(scmat_soupx), '\n') rm(scmat_filt) return(scmat_soupx) } else return(sX$fit$rhoEst) } ```
------------------------------------------------------------------------ ------------------------------------------------------------------------ # Load a 10X Cell Ranger raw count matrix ## The dataset - The data we will work on is hosted on [Zenodo](https://zenodo.org){target="_blank"}. To ease its retrieval, we can define its ID as a variable ```{r zid} # zid ## This is actually to help me updating this training without hassle zen_id <- '14034221' ## This is the path to the current EB3I backup sessionid <- '2538_eb3i_n1_2025' ``` - Here will we train ourselves to load into R a single cell RNAseq data produced by 10X Genomics' software **`Cell Ranger`**. - We will work with a public dataset provided by the manufacturer, that consists in \~ **10,000 PBMC** (peripheral bone marrow cells) from a human donor (downsampled to 1/10th, for smaller computation time). - The experiment was performed with the **3' capture kit v3**. - The analysis was performed with **`Cell Ranger v3`**, mapping on the **`GRCh38-2020-A`** manufacturer reference. - The data consists into the typical 10x 3-files structure, hosted in a Zenodo respository (Id : [`r zen_id`](https://zenodo.org/records/%60r%20zen_id%60 "Zenodo Preproc.1"){target="_blank"}) - `barcodes.tsv.gz` : the list of putative droplet **barcodes** - `features.tsv.gz` : the list of **features** (genes) - `matrix.mtx.gz` : the **feature** x **barcode** expression count **matrix** ## Download data - The relatively complex code chunk hereafter has been written to : - let you download the data files from : - Zenodo by default - if `backup = TRUE`, a local backup instead (in case we loose the internet connection) - if the requested data are already available locally : - load this from the local source - if `force = TRUE` a new download is forced ```{r dlzen10x, message = TRUE, echo = TRUE} # dlzen10x ### Named files (will be used later on !) mtx_file <- "matrix.mtx.gz" features_file <- "features.tsv.gz" barcodes_file <- "barcodes.tsv.gz" summary_file <- "pbmc_10k_v3_summary.html" ## Filename(s) to retrieve toget_files <- c(mtx_file, features_file, barcodes_file, summary_file) ## Folder to store retrieved files local_folder <- input_dir ## Use local backup ? backup <- FALSE if(backup) message("Using local backup !") ## Force download ? force <- FALSE if(force) message("Forcing (re)download !") ### Define remote folder remote_folder <- if (backup) paste("/shared/projects/", sessionid, "/atelier_scrnaseq/TD/BACKUP/10X/") else paste0("https://zenodo.org/records/", zen_id, "/files/") ### Reconstruct the input paths remote_path <- paste0(remote_folder, "/", toget_files) ### Reconstruct the output paths local_path <- paste0(local_folder, "/", toget_files) ## Retrieve files (if they don't exist), in loop for (tg in seq_along(toget_files)) { ## If the file does not locally exist if (!file.exists(local_path[tg]) | force) { ## Retrieve data if(backup) { file.copy(from = remote_path[tg], to = local_path[tg]) } else { download.file(url = remote_path[tg], destfile = local_path[tg]) } ## Check if downloaded files exist locally if(file.exists(local_path[tg])) message("\tOK") } else message(paste0(toget_files[tg], " already downloaded !")) } ``` ## Load into R - **Load** the **10X** **data** files into R : ```{r q_r10X, eval = FALSE, class.source="question", eval = FALSE} # q_r10X How ? ``` - A good starter ```{r a_r10X1, class.source = c("fold-hide", "answer"), echo = FALSE} # a_r10X1 cat('https://lmddgtfy.net/?q=load%2010x%20data%20in%20R') ``` - Back to your local help ressource ```{r a_r10X2, class.source = c("fold-hide", "answer"), eval = FALSE} # a_r10X2 ## Reading the function help page ?Seurat::Read10X ```

We can now apply it to our data : ```{r load10x, class = "fold-hide"} # load10x ## Loading unfiltered 10X matrix scmat <- Seurat::Read10X(data.dir = input_dir) ```
- However, we don't know what the loaded object actually looks like in R... - **Questions** : - Do you know a way to ... ```{r q_whatis, class.source="question", eval = FALSE} # q_whatis ... know what IS the type of object that was created ? ```
```{r a_whatis, class.source = c("fold-hide", "answer"), eval = FALSE} # a_whatis ## To know the type of an R object : methods::is() ?methods::is ``` ```{r gois, class.source = "fold-hide"} # gois ## Get the type methods::is(scmat) ```

```{r q_struc, class.source="question", eval = FALSE} # q_struc ... observe the STRucture of the resulting object ? ```
```{r a_struc, class.source = c("fold-hide", "answer"), eval = FALSE} # a_struc ## To get a basic structure description : utils::str() ?utils::str ``` ```{r go_str, class.source = "fold-hide"} # go_str ## Get the basic structure utils::str(scmat) ```
Not that meaningful, though... At least we know we have a matrix, so one can get its dimensions ! ```{r scmat_dim} # scmat_dim ## Dimensions of a >1D object dim(scmat) ```
## Features cleanup `Seurat` does not like at all **underscores** (`_`) in feature names ... We will have to : - **control** if our dataset contains some ? Here, we will use the `base::grep()` function that looks in character strings `x` if they contain a provided set of characters as a `pattern`, and returns their position (when `value = FALSE`, default) or value itself (`value = TRUE`) ```{r us_check1} # us_check1 ## Check feature names with underscore(s) base::grep(pattern = '_', x = rownames(scmat), value = TRUE) ```
- if so, **clean** it ! Here we will use the `base::gsub()` function that looks for a `pattern` in a character string `x` and substitutes it with a `replacement` character string. ```{r us_clean} # us_clean ## Clean features : replacing underscores by dashes rownames(scmat) <- base::gsub( pattern = "_", replacement = "-", x = rownames(scmat)) ``` - then, **control** the efficiency of our cleaning (with a new use of `base::grep`) ```{r us_check2, class.source="notrun", class.output="notruno"} # us_check2 ## Post-cleanup control base::grep(pattern = "_", x = rownames(scmat), value = TRUE) ```
------------------------------------------------------------------------ ------------------------------------------------------------------------ # Empty droplets - While a large part of our barcodes contain **no count** at all, barcodes with at least one total count **may not** mandatorily correspond to **true cells**, due to ambient RNA. - Here we want to **identify** empty droplets, then **remove** them from the matrix. ## Detection - To detect the margins of "true" cells versus empty droplets, we will rely on the so-called **`double-kneeplot`** : a plot of **UMI counts per barcode**, in function of their increasingly **ordered rank**.
![](elbow_knee_KNEE.png)

- To draw such plot, we need to **rank the barcodes** thanks to their total counts. ### Ranking barcodes - We will use the [**DropletUtils**](https://bioconductor.org/packages/release/bioc/html/DropletUtils.html){target="_blank"} package ```{r h_barcodeRanks, eval = FALSE} # h_barcodeRanks ## Read the function help page ?DropletUtils::barcodeRanks() ``` ```{r bcranks} # bcranks ## Generate the rank statistics bc_rank <- DropletUtils::barcodeRanks(scmat) ``` - Check the ranking of barcodes, according to their counts ```{r bcrank_top, class.source="notrun", class.output="notruno", fold.output = FALSE} # bcrank_top ## Show a summary of the ranks (ordered for display convenience) print(bc_rank[order(bc_rank$rank),]) ```
### Kneeplot Now, we can **draw** our kneeplot : ```{r kneeplot1, class.source="notrun", class.output="notruno", fold.plot = FALSE} # kneeplot1 ## Plot graphics::plot(x = bc_rank$rank, y = bc_rank$total + 1, log = "xy", xlab = "Barcode rank", ylab = "Total UMIs", col = "black", pch = ".", cex = 5, main = "Kneeplot") ```
- **Question** : ```{r q_knee, class.source="question", eval = FALSE} # q_knee Looking at the kneeplot, can you roughly predict how many barcodes will be kept as cells (ie, as non-empty droplets) ? ```
```{r a_knee, class.source = c("fold-hide", "answer"), eval = FALSE} # a_knee ## . The "cliff" in the kneeplot is at a ## rank value a bit below ~ 1,000. ``` Then, we can use these ranks to **identify "true" cells** : ```{r h_emptyDrops, class.source="notrun", class.output="notruno", eval = FALSE} # h_emptyDrops ## Read the function help page ?DropletUtils::emptyDrops ``` The function we will use relies on random number generation, so we will fix the RNG seed, for reproducibility ```{r emptydrops1} # emptydrops1 ## Set the RNG seed set.seed(my_seed) ## Identify empty droplets bc_rank2 <- DropletUtils::emptyDrops(scmat) ## Whats is the structure of the returned output ? str(bc_rank2) ``` We can show a summary of the ordered barcodes : ```{r emptysmry, class.source="notrun", class.output="notruno"} # emptysmry ## Show a summary of the qualified ranks (ordered for display convenience) print(bc_rank2[order(bc_rank2$Total, decreasing = TRUE),]) ```
### Selection Barcodes considered as empty droplets have **no p-value** (`FDR` = NA) ```{r bcr2_na} # bcr2_na ## Creating a validation column (empty droplets have NA as p-value) bc_rank2$VALID <- !is.na(bc_rank2$FDR) ## Quantify "real cells" table(bc_rank2$VALID) ```
We can **control the stringency** by setting a FDR-adjusted p-value **threshold** ```{r valid_fdr} # valid_fdr ## Recall the max p-value we set as a variable early on max_p ## We've set the max p-value to 1E-03 bc_rank2$VALID[bc_rank2$FDR >= max_p] <- FALSE ## Control table(bc_rank2$VALID) ```
Now, we can **display** our **true cells** on the kneeplot : ```{r kneeplot2, fig.align="center"} # kneeplot2 ## Plot with highlight of the selected barcodes (red) graphics::plot(x = bc_rank$rank, y = bc_rank$total+1, log = "xy", xlab = "Barcode rank", ylab = "Total UMIs", col = ifelse(bc_rank2$VALID, "red", "black"), pch = ".", cex = ifelse(bc_rank2$VALID, 7, 3), main = paste0("Kneeplot (", length(which(bc_rank2$VALID)), " barcodes kept)")) ```
- **Question** : ```{r q_kneered, class.source="question", eval = FALSE} # q_kneered Is there something unexpected for the retained "TRUE" cells (red) in this kneeplot ? ```
```{r a_kneered, class.source = c("fold-hide", "answer"), eval = FALSE} # a_kneered ## . You may observe that the selection of "true" cells (red) ## does not strictly correspond to a "cut" in the kneeplot ## descending curve. ## ## . This is because emptyDrops does not only "follow" the curve ## to determine a threshold corresponding to a "minimal count" ## value to consider a barcode as a cell, but also performs a ## statistical analysis for each barcode, considering how much ## its expression profile ressembles the one of other cells, ## even with a very different (lower) global level of expression. ```

## Removal We just need to restrict our count matrix to "true" cells. - How many barcodes do we have at the moment ? ```{r scdim3, class.source="notrun", class.output="notruno"} # scdim3 dim(scmat) ```
- Filter empty droplets out : ```{r edfilter} # edfilter ## Restrict to valid barcodes scmat_cells <- scmat[, bc_rank2$VALID] ## Control dim(scmat_cells) ```
------------------------------------------------------------------------ ------------------------------------------------------------------------ # Ambient RNA - The counts measured in this matrix of (now considered "true") single cells do **not** reflect the measurement of these cells expression **only** - Due to other **over-lysed / dying cells** in the emulsion, we also measured fraction of **ambient RNA** that adds up to the "real" expression - We may **estimate** it by observing counts existing at a **minimal level** across cells that greatly differ in their expression profile - Just in case, due to the fun tool name we will use, `SoupX`, I'll write down "soup" as an alias for "ambient RNA", but both terms have the same meaning.
## Estimate the "soup" [**SoupX**](https://cran.r-project.org/web/packages/SoupX/readme/README.html){target="_blank"} can estimate the rate and composition of ambient RNA, in several ways : - **Manual mode** : providing a list of features expected to reflect the soup (genes expressed at high levels in a majority of the expected cell types) - **Automatic modes**, using the empty droplets-**filtered** raw count matrix ... - ... along **with the unfiltered** matrix (ie, containing all quantified droplets). This is the most efficient mode. - ... **solely** : this is less efficient, but useful when the unfiltered matrix is not available. ```{r h_SoupX_auto, eval = FALSE, class.source="notrun", class.output="notruno"} # h_SoupX_auto ## Reading the SoupX package main help page ?SoupX::SoupX ``` Let's run SoupX, just asking for an estimation of the "soup" proportion : ```{r soupxrhoE} # soupxrhoE ## TIP : In case of error : # install.packages("irlba", type="source", force=TRUE) ## SoupX for ambient fraction estimation soup_frac <- SoupX_auto( scmat_filt = scmat_cells, scmat_raw = scmat) ```
***NOTE*** : We can observe [**MALAT1**](https://www.biorxiv.org/content/10.1101/2024.07.14.603469v2){target="_blank"} as the top first gene (as well as multiple riboprotein-coding genes).
What's the estimated fraction of soup in our counts ? ```{r soupxfrac} # soupxfrac cat("Soup fraction : ", soup_frac) ```
- **Question** : ```{r q_souprate, class.source="question", eval = FALSE} # q_souprate Should we try to remove such an amount of ambient RNA ? ```
```{r a_souprate, class.source = c("fold-hide", "answer"), eval = FALSE} # a_souprate The best way to know is to try it out =D ```
## Remove the "soup" SoupX can **remove** the soup (subtract counts for identified soup genes to all barcodes/cells) using different methods : - **Automatically** (safer) - Specifying a **fraction of counts** to remove. - *Pros* : - By removing X % of reads : - if soup was effectively present (at a rate \<= X), this will **mostly affect soup** genes first. - If not, it will **decrease counts** globally, thus have negligible effect in differences between cells / cell types. - *Cons* : - **Counts are** already **rare** in droplet-based single cell technologies... Removing them "by default" makes them **even more rare**... - If actual soup **rate is higher** than the defined rate X to remove, this mode will remain **uneffective**... ```{r soupxrm} # soupxrm ## Run SoupX in "removal" mode scmat_unsoup <- SoupX_auto( scmat_filt = scmat_cells, scmat_raw = scmat, return_object = TRUE) ```
## Effect of the "soup" removal - Now, we want to **characterize the effect** of the removal of the ambient RNA, by comparing the pre- and post- soupX matrices. - An easy way is to **describe**, then **visualize** such an effect.

### Describe and visualize {.tabset .tabset-fade .tabset-pills} #### Before SoupX (DEMO) We can describe our count matrix thanks to some useful metrics : - Total counts ```{r descBS_totcount, class.source="notrun", class.output="notruno"} # descBS_totcount ## All counts base::sum(scmat_cells) ``` - Sparsity (fraction of `0`s) ```{r descBS_sparsity, class.source="notrun", class.output="notruno"} # descBS_sparsity ## Compute sparsity (number of matrix values at 0 / cross-product of matrix dim) base::sum(sparseMatrixStats::colCounts( x = scmat_cells, value = 0) ) / base::prod(dim(scmat_cells)) ``` - Distribution of counts in cells ```{r descBS_nCount, class.source="notrun", class.output="notruno"} # descBS_nCount ## Sums up counts for each cell base::summary(sparseMatrixStats::colSums2(x = scmat_cells, na.rm = TRUE)) ``` - Distribution of expressed genes in cells ```{r descBS_nFeature, class.source="notrun", class.output="notruno"} # descBS_nFeature ## All genes minus 0-count genes base::summary(nrow(scmat_cells) - sparseMatrixStats::colCounts( x = scmat_cells, value = 0, na.rm = TRUE)) ``` #### After SoupX We can describe our count matrix thanks to some useful metrics : - Total counts ```{r descAS_totcount} # descAS_totcount ## All counts base::sum(scmat_unsoup) ``` - Sparsity (fraction of `0`s) ```{r descAS_sparsity} # descAS_sparsity ## Compute sparsity (number of matrix values at 0 / cross-product of matrix dim) base::sum(sparseMatrixStats::colCounts( x = scmat_unsoup, value = 0) ) / base::prod(dim(scmat_unsoup)) ``` - Distribution of counts in cells ```{r descAS_nCount} # descAS_nCount ## Sums up counts for each cell base::summary(sparseMatrixStats::colSums2(x = scmat_unsoup, na.rm = TRUE)) ``` - Distribution of expressed genes in cells ```{r descAS_nFeature} # descAS_nFeature ## Compute sparsity (number of matrix values at 0 / cross-product of matrix dim) base::summary(nrow(scmat_cells) - sparseMatrixStats::colCounts( x = scmat_unsoup, value = 0, na.rm = TRUE)) ``` ## {.unnumbered}
------------------------------------------------------------------------ ------------------------------------------------------------------------
- **Questions** : ```{r q_descdiff, class.source="question", eval = FALSE} # q_descdiff Can you describe the difference (especially for counts and sparsity) ? Was it expected ? ```
```{r a_descdiff, class.source = c("fold-hide", "answer"), eval = FALSE} # a_descdiff ## . Total counts, max value, counts per cell and number of ## expressed features per cell, all have decreased. ## ## . The sparsity level has slightly increased (removing the soup ## obliterated all counts for some features in some cells). ## ## . All of this is absolutely expected. ```
What are the features most affected by the soup removal ? ```{r soupfeatures, class.source="notrun", class.output="notruno"} # soupfeatures ## Compute the fraction matrix (PRE / POST) ## NOTE : we may have 0 counts in the divider, ## so we increment both matrix by +1 ! cont_frac <- (scmat_cells + 1) / (scmat_unsoup + 1) ## Fraction of counts decreased by soup removal, per feature (ordered) feat_frac <- sort(sparseMatrixStats::rowMeans2(cont_frac), decreasing = TRUE) ### Display Top 5 features utils::head(feat_frac, decreasing = TRUE) ## Removing objects to free some RAM rm(cont_frac, feat_frac) ``` ```{r qsoupfeat, class.source="question", eval = FALSE} # qsoupfeat Does something about these genes strike you ? ``` Hint : **fraction** or **difference** ? ```{r soupfeaturesdiff, class.source=c("fold-hide", "answer"), class.output="answero"} # soupfeaturesdiff ## Compute the DIFFERENCE matrix (PRE / POST) ## NOTE : we may have 0 counts in the divider, ## so we increment both matrix by +1 ! cont_dif <- scmat_cells - scmat_unsoup ## Fraction of counts decreased by soup removal, per feature (ordered) feat_dif <- sort(sparseMatrixStats::rowMeans2(cont_dif), decreasing = TRUE) ### Display Top 5 features utils::head(feat_dif, decreasing = TRUE) ## Removing objects to free some RAM rm(cont_dif, feat_dif) ```
### Plot the "top 1" feature {.tabset .tabset-fade .tabset-pills} We want to visualize the expression of the "top" soup gene, LYZ, in our dataset. To do so, we will generate a projection of the data into a reduced (2-D) mathematical space. This is done by using a series of R function from the Seurat package, that we won't describe here and now : understanding these commands below is not the scope of our current session, but it will be during the next few days. Actually, these commands perform all the steps we'll see in the next few days, in a row, but with _default values_ (thus, almost certainly inadequate to our current data). In consequence, please just consider the output as a way to quickly and dirtily visualize the data : the depicted topology should be considered as very rough and mostly imperfect. #### Before SoupX ```{r bsoupx, fig.height = 6, fig.width = 6, class.source="notrun", class.output="notruno"} # bsoupx ## Roll process, using a temporary "sobj_presoupx" Seurat object sobj_presoupx <- Seurat::CreateSeuratObject(counts = scmat_cells, project = "PBMC10K_BeforeSoupX") sobj_presoupx <- Seurat::NormalizeData(object = sobj_presoupx, verbose = FALSE) sobj_presoupx <- Seurat::ScaleData(object = sobj_presoupx, verbose = FALSE) sobj_presoupx <- Seurat::FindVariableFeatures(object = sobj_presoupx, verbose = FALSE) sobj_presoupx <- Seurat::RunPCA(object = sobj_presoupx, npcs = 21, verbose = FALSE) sobj_presoupx <- Seurat::RunUMAP(object = sobj_presoupx, dims = c(1:20), verbose = FALSE) ## Top1 feature plot fp_pre <- Seurat::FeaturePlot(object = sobj_presoupx, features = 'LYZ') + Seurat::DarkTheme() print(fp_pre) ```
#### After SoupX ```{r asoupx, fig.height = 6, fig.width = 6} # asoupx ## Roll process, using a temporary "sobj_postsoupx" Seurat object sobj_postsoupx <- Seurat::CreateSeuratObject(counts = scmat_unsoup, project = "PBMC10K_AfterSoupX") sobj_postsoupx <- Seurat::NormalizeData(object = sobj_postsoupx, verbose = FALSE) sobj_postsoupx <- Seurat::ScaleData(object = sobj_postsoupx, verbose = FALSE) sobj_postsoupx <- Seurat::FindVariableFeatures(object = sobj_postsoupx, verbose = FALSE) sobj_postsoupx <- Seurat::RunPCA(object = sobj_postsoupx, npcs = 21, verbose = FALSE) sobj_postsoupx <- Seurat::RunUMAP(object = sobj_postsoupx, dims = c(1:20), verbose = FALSE) ## Top1 feature plot fp_post <- Seurat::FeaturePlot(object = sobj_postsoupx, features = 'LYZ') + Seurat::DarkTheme() print(fp_post) ```
## {.unnumbered} ------------------------------------------------------------------------ ------------------------------------------------------------------------ Merging plots for ease of use : ```{r soupx_umaps, fig.width = 12, fig.height = 6, class.source="beyond", class.output="beyondo"} # soupx_umaps ## Using the patchwork package to merge plots (and ggplot2 to add titles) patchwork::wrap_plots( list( fp_pre & ggplot2::ggtitle(label = "LYZ before SoupX"), fp_post & ggplot2::ggtitle(label = "LYZ after SoupX")), nrow = 1) ```
- **Question** ```{r q_umapsoupx, class.source="question", eval = FALSE} # q_umapsoupx Can you compare the two plots : - for the LYZ expression across the 2-D space ? - about the space topologies ? - for the LYZ expression globally ? ```
```{r a_umapsoupx, class.source = c("fold-hide", "answer"), eval = FALSE} # a_umapsoupx ## . The level of expression measured in cells with smaller expression ## before SoupX has gone to almost none after : the ambient expression ## of this gene has efficiently been removed. ## ## . Hopefully, the cluster(s) with higher expression remain(s) high. ## ## . The expression scale upper bound after SoupX is higher than before ?! ## But we REMOVED some counts ?! While being counter-intuitive, this is ## a positive consequence of the ambient removal on the scaling of ## barcodes expression (see later). ## ## . The range of both X and Y axes have increased, depicting a better ## separation of / distance between some of the clusters. ## ## . The global topology of clusters remains close (but not identical). ## ## . Clusters compacity has increased. ```
## Conclusion - Removing ambient RNA has a **positive effect** : - on the **quality of the expression** level measurements - on the observed **topology** - ... despite an estimation of only ~5% ! - Actually, ambient RNA level and composition is (one of) the major source(s) of the **batch effect bias** that may alter the integration of different samples.
------------------------------------------------------------------------ ------------------------------------------------------------------------


**Beyond** : 1. One can check if the "unsouped" (ie, post-SoupX) matrix retain some soup ?. How would you do it ? ```{r b_soupx2, class.source = c("fold-hide", "beyond"), eval = FALSE} # b_soupx2 ## SoupX for ambient fraction estimation soup2_frac <- SoupX_auto( scmat_filt = scmat_unsoup, scmat_raw = scmat) ## Run SoupX in "removal" mode scmat_unsoup2 <- SoupX_auto( scmat_filt = scmat_unsoup, scmat_raw = scmat, return_object = TRUE) ``` 2. Save the `sobj_presoupx` R object on disk. ```{r b_save_answer1, class.source = c("fold-hide", "beyond"), eval = FALSE} # b_save_answer1 ## A single object : use RDS format with saveRDS() saveRDS(object = sobj_presoupx, file = paste0(output_dir, "/sobj_presoupx.RDS")) ``` 3. Save both the `sobj_presoupx` and `sobj_postsoupx` objects in a single archive on disk. ```{r b_save_answer2, class.source = c("fold-hide", "beyond"), eval = FALSE} # b_save_answer2 ## Multiple objects : use Rdata format with save() save(list = c(sobj_presoupx, sobj_postsoupx), file = paste0(output_dir, "/sobjects.RDA")) ``` 4. Same as 1. but use the `bzip2` compression algorithm ```{r b_save_answer3, class.source = c("fold-hide", "beyond"), eval = FALSE} # b_save_answer3 ## Compress with bzip2 saveRDS(object = sobj, file = paste0(output_dir, "/sobj.RDS"), compress = "bzip2") ```
------------------------------------------------------------------------ ------------------------------------------------------------------------


# Rsession For reproducibility and context, it is recommended to include in your RMarkdown the list of loaded packages and their version. ```{r rsession, class.source="notrun", class.output="notruno"} # rsession utils::sessionInfo() ```