--- title: "EB3I n1 2025 scRNAseq
-
PRE-PROCESSING (II)
-
Quality control : Counts & metrics" date: "2025-16-21.22" author: - name: "EB3I n1 scRNAseq Team" - name: "Lilia YOUNSI" email: "lilia.younsi@inserm.fr" - name: "Sebastien MELLA" email: "sebastien.mella@pasteur.fr" output: rmdformats::readthedown: fig_width: 8 fig_height: 6 highlight: tango ## Theme for the code chunks embed_fonts: TRUE number_sections: true ## Adds number to headers (sections) theme: flatly ## CSS theme for the HTML page collapsed: true ## By default, the TOC is folded toc_depth: 3 smooth_scroll: true ## Smooth scroll of the HTML page self_contained: true ## Includes all plots/images within the HTML code_download: true ## Adds a button to download the Rmd code_folding: show thumbnails: false lightbox: true fig_caption: false gallery: true use_bookdown: true always_allow_html: true ## Allow plain HTML code in the Rmd editor_options: markdown: wrap: 72 --- ```{r knit_setup, echo = FALSE} knitr::opts_chunk$set( echo = TRUE, # Print the code eval = TRUE, # Run command lines message = FALSE, # Print messages prompt = FALSE, # Do not display prompt comment = NA, # No comments on this section warning = FALSE, # Display warnings tidy = FALSE, fig.align="center", # results = 'hide', width = 100 # Number of characters per line ) ``` ```{css, echo=FALSE} .notrun { background-color: lightgrey !important; border: 3px solid black !important; } .notruno { background-color: lightgrey !important; color : black !important; } .question { background-color: aquamarine !important; color : black !important; border: 3px solid limegreen !important; } .questiono { background-color: aquamarine !important; color : black !important; } .answer { background-color: navajowhite !important; border: 3px solid brown !important; } .answero { background-color: navajowhite !important; color : black !important; } .beyond { background-color: violet !important; border: 3px solid purple !important; } .beyondo { background-color: violet !important; color : black !important; } ``` ```{r knit_hook, echo = FALSE} hooks = knitr::knit_hooks$get() hook_foldable = function(type) { force(type) function(x, options) { res = hooks[[type]](x, options) if (isFALSE(options[[paste0("fold.", type)]])) return(res) paste0( "

Show ", type, "

\n\n", res, "\n\n

" ) } } knitr::knit_hooks$set( output = hook_foldable("output"), plot = hook_foldable("plot") ) ``` ------------------------------------------------------------------------ ------------------------------------------------------------------------ ![](eb3i_banner.png) ------------------------------------------------------------------------ ------------------------------------------------------------------------ # PREAMBLE ## Purpose of this session This file describes the different steps to perform the quality control of a single cell dataset, based on (per cell) : * the number of **UMI** (~transcripts) detected * the number of **expressed genes** detected * the proportion of UMI corresponding to **mitochondrial** genes * the proportion of UMI corresponding to **riboprotein-coding** genes * the proportion of transcripts related to stress signature **Input data**: A filtered, flat count matrix retrieved from GEO. **Output data**: A `Seurat` object annotated for the "bad" cells to filter out for downstream analysis. ------------------------------------------------------------------------ ------------------------------------------------------------------------ # Start Rstudio - Using the [OpenOnDemand/Rstudio cheat sheet](https://moodle.france-bioinformatique.fr/pluginfile.php/1475/mod_folder/content/0/OoD_R_Rstudio.html){target="_blank"}, connect to the [OpenOnDemand portal](https://ondemand.cluster.france-bioinformatique.fr){target="_blank"} and **create a Rstudio session** with the right resource requirements, thanks to the cheat sheet. ------------------------------------------------------------------------ ------------------------------------------------------------------------ # Warm-up - We now set **common parameters** as new variables, once and for all for this session : ```{r setparam} # setparam ## Set your project name # WARNING : Do not just copy-paste this ! It's MY project name ! Put YOURS !! project_name <- "ebaii_sc_teachers" ## Control if the project_name exists on the cluster cat('PATH CHECK : ', dir.exists(paste0('/shared/projects/', project_name))) ## Seed for the RNG my_seed <- 1337L # ## Empty droplets max p-value # max_p <- 1E-03 ``` ------------------------------------------------------------------------ ------------------------------------------------------------------------ # Prepare the data structure ## Main directory ```{r maindir} # maindir ## Preparing the path TD_dir <- paste0("/shared/projects/", project_name, "/SC_TD") ## Creating the root directory # dir.create(path = TD_dir, recursive = TRUE) ## Print the root directory on-screen print(TD_dir) ``` ## Current session ```{r sessiondir} # sessiondir ## Creating the session (Preproc.2) directory session_dir <- paste0(TD_dir, "/02_Preproc.2") dir.create(path = session_dir, recursive = TRUE) ## Print the session directory on-screen print(session_dir) ``` ## Input directory ```{r indir} # indir ## Creating the INPUT data directory input_dir <- paste0(session_dir, "/DATA") dir.create(path = input_dir, recursive = TRUE) ## Print the input directory on-screen print(input_dir) ``` ## Genelists directory This is a directory where we will store additional information from knowledge bases about genes used to estimate the cell cycle phase of cells. ```{r resdir} # resdir res_dir <- paste0(TD_dir, "/Resources") glist_dir <- paste0(res_dir, "/Genelists") ## Create the directory dir.create(path = glist_dir, recursive = TRUE) ## Print the resources directory on-screen print(glist_dir) ``` ## Output directory ```{r outdir} # outdir ## Creating the OUTPUT data directory output_dir <- paste0(session_dir, "/RESULTS") dir.create(path = output_dir, recursive = TRUE) ## Print the output directory on-screen print(output_dir) ``` ------------------------------------------------------------------------ ------------------------------------------------------------------------ # Prerequisites We have to retrieve the input data file ```{r mat_dl} # mat_dl local <- FALSE ## The raw count matrix we will start from scmat_source <- "GSM4861194_gex_2_raw_gene_expression.tsv.gz" ## Download the file from Zenodo if (!local) { ### ZenID zen_id <- "14033941" ### Zen Path zen_backup_file <- paste0("https://zenodo.org/records/", zen_id, "/files/", scmat_source) ## The path to the locally saved input file scmat_file <- paste0(input_dir, '/', scmat_source) ## Download the file download.file(url = zen_backup_file, destfile = scmat_file) } else { ebaii_session <- '2538_eb3i_n1_2025' scmat_file <- paste0( '/shared/projects/', ebaii_session, '/atelier_scrnaseq/TD/BACKUP/TSV/', scmat_source) } ``` ------------------------------------------------------------------------ ------------------------------------------------------------------------ # Load the genelists resources We retrieve the genelists ```{r gl_dl} # gl_dl local <- FALSE ## The genelist files mito_source <- "mus_musculus_mito_symbols_20191015.rds" ribo_source <- "mus_musculus_cribo_symbols_20191015.rds" stress_source <- "mus_musculus_stress_symbols_20200224.rds" gl_sources <- c(mito_source, ribo_source, stress_source) ## The (future) local files mito_file <- paste0(input_dir, '/', mito_source) ribo_file <- paste0(input_dir, '/', ribo_source) stress_file <- paste0(input_dir, '/', stress_source) gl_files <- c(mito_file, ribo_file, stress_file) ## Download the file from Zenodo if (!local) { ### ZenID zen_id <- "14037355" ### Looping on files for (glf in seq_along(gl_sources)) { ### Zen Path zen_backup_file <- paste0("https://zenodo.org/records/", zen_id, "/files/", gl_sources[glf]) ## Download the file download.file(url = zen_backup_file, destfile = gl_files[glf]) } rm(gl_files) } else { ## Local mode ebaii_session <- '2538_eb3i_n1_2025' localbackup_dir <- paste0('/shared/projects/', ebaii_session, '/atelier_scrnaseq/TD/RESOURCES/GENELISTS/') mito_file <- paste0(localbackup_dir, '/mus_musculus_mito_symbols_20191015.rds') ribo_source <- paste0(input_dir, '/mus_musculus_cribo_symbols_20191015.rds') stress_source <- paste0(input_dir, '/mus_musculus_stress_symbols_20200224.rds') } ``` ------------------------------------------------------------------------ ------------------------------------------------------------------------ # Biological context For this series of training sessions, we will work on data from this [Paiva et al.](https://www.sciencedirect.com/science/article/pii/S2211124721002813) publication. * The study concerns **thymus autonomy**: * The thymus is an "organ of passage", critical in its function to the adaptive immune system for the maturation of T cell lymphocytes. * This maturation involves two main steps, performed thanks to macrophages: * Positive selection : keeping cells that successfully develop react appropriately with MHC immune receptors of the body * Negative selection : keeping cells that do not react against natural proteins of the body. * Thymus _autonomy_ is a natural mechanism that allows to create T cells in the thymus by differentiation and cell competition, even when normal progenitors from the bone marrow are lacking, in critical conditions. * This mechanism is known in its effects, but the cells involved in are not. * This study is of importance in the health field, as this mechanism relies on a temporary loss of control of the cell normal functions. * The consequence is that if thymus is in autonomy for too long (few weeks), this is a prelude for leukemia !
![](thymus_autonomy.png)
* Organism is : **mus musculus** * Individuals are : newborn **mice, grafted** * The design corresponds to **two conditions** (Test / Control) * **Control** : thymus from **wild type** newborn mice transplanted into **wild type** juvenile mouse. In this control case, **donor** T-cells progenitors (DN3) were replaced by **host** cells **3 weeks** after transplantation. * **Test** : thymus from **wild type** newborn mice transplanted in **KO Rag-/-** type juvenile mouse (the KO partially impairs their ability to produce T-cell progenitors in normal amounts). In this test case, **donor** T-cells progenitors (DN3) were replaced by **host** cells **9 weeks** after transplantation, showing that the donor DN3 cells outlived their normal lifespan by ~6 weeks. ![](paiva_wt_ko.png) You will mainly work on the **KO sample : `TD3A`**. The input data consists in a **count matrix**, as a gzipped tabular text file, that contains everything needed to create a basic Seurat object : * The expressions **counts** * The **feature names** (here, gene symbols) * The **barcode names** This matrix has already been **filtered for empty droplets**. # Environment We load the packages of interest: ```{r packages, message=FALSE, warning=FALSE} # packages library(ggplot2) library(Seurat) ``` # Data We load the count matrix: ```{r mat_load} # mat_load ## Loading the matrix directly as a sparseMatrix scmat <- scuttle::readSparseCounts( file = scmat_file, sep = "\t") ## Displaying its size in-memory (this is a basic matrix) format(utils::object.size(scmat), units = "auto") ``` ```{r mat_desc} # mat_desc ## A basic description of the matrix dim(scmat) scmat[c(1:5), c(1:5)] ``` We want to build a Seurat object from the count matrix. But how do we do ??

```{r mat2seurat, class.source = c("fold-hide", "answer"), eval = FALSE} # mat2seurat ?Seurat::CreateSeuratObject() ```

```{r make_sobj} # make_sobj sobj = Seurat::CreateSeuratObject(counts = scmat, assay = "RNA", project = "TD3A") sobj ``` We do not need the count matrix anymore : ```{r rm_mat} # rm_mat rm(scmat) ``` We set the filtering thresholds based on quality control-related metrics. Adjust them as necessary based on the figures. ```{r set_thresholds} # set_thresholds cut_nCount_RNA = 1000 cut_log10_nCount_RNA = log10(cut_nCount_RNA) cut_nFeature_RNA = 750 cut_percent_mt = 5 cut_percent_rb = 20 cut_percent_st = 6 ``` We define a nice palette to visualize the QC metrics: ```{r color_palette} # color_palette color_palette = c("lightgray", "#FDBB84", "#EF6548", "#7F0000", "black") ``` For the QC metrics related to the proportion of UMI belongs to a specific gene sets, we need to load the gene sets. ```{r load_lists} # load_lists ## MITO mito_symbols = readRDS(mito_file) mito_symbols ## RIBO ribo_symbols = readRDS(ribo_file) ribo_symbols ## MECHANICAL STRESS stress_symbols = readRDS(stress_file) stress_symbols ``` We keep only the gene symbols available in our data : ```{r subset_symbols} # subset_symbols mito_symbols = intersect(mito_symbols, rownames(sobj)) mito_symbols ribo_symbols = intersect(ribo_symbols, rownames(sobj)) ribo_symbols stress_symbols = intersect(stress_symbols, rownames(sobj)) stress_symbols ``` **Note**: A more robust analysis may use the gene _identifiers_ (like EntrezID) rather than gene _symbols_. # Quality controls ## Compute metrics What is already available in the Seurat object ? ```{r see_metadata} # see_metadata head(sobj@meta.data) ``` How do the two first QC metrics vary ? ```{r summary_metadata} # summary_metadata summary(sobj@meta.data) ``` In the column `nCount_RNA`, the maximum is far from the third quartile. For visualization purpose, we transform this column to log10 scale. ```{r log10_ncount_RNA} # log10_ncount_RNA sobj$log10_nCount_RNA = log10(sobj$nCount_RNA) summary(sobj@meta.data) ``` We compute the percentage of UMI related to each of the three list of genes. First, we compute the proportion of transcripts related to mitochondrial genes, per cell. ```{r percent_mt} # percent_mt sobj = Seurat::PercentageFeatureSet( object = sobj, assay = "RNA", features = mito_symbols, col.name = "percent_mt") summary(sobj$percent_mt) ``` Then, we compute the proportion of transcripts related to riboprotein-coding genes, per cell: ```{r percent_rb} # percent_rb sobj = Seurat::PercentageFeatureSet( object = sobj, assay = "RNA", features = ribo_symbols, col.name = "percent_rb") summary(sobj$percent_rb) ``` Finally, we compute the proportion of transcripts related to mechanical stress signature, per cell : ```{r percent_st} # percent_st sobj = Seurat::PercentageFeatureSet( sobj, assay = "RNA", features = stress_symbols, col.name = "percent_st") summary(sobj$percent_st) ``` We now have all the QC-related metrics: ```{r final_summary} # final_summary summary(sobj@meta.data) ``` ## Failing cells We identify the cells that do not pass the quality controls. This will be used for the visualization and later for cells filtering. If the filtering thresholds are modified, do not forget to run again this chunk. ```{r fail_cells} # fail_cells fail_percent_mt <- rownames(sobj@meta.data)[sobj$percent_mt > cut_percent_mt] fail_percent_rb <- rownames(sobj@meta.data)[sobj$percent_rb > cut_percent_rb] fail_percent_st <- rownames(sobj@meta.data)[sobj$percent_st > cut_percent_st] fail_nFeature_RNA <- rownames(sobj@meta.data)[sobj$nFeature_RNA < cut_nFeature_RNA] fail_nCount_RNA <- rownames(sobj@meta.data)[sobj$nCount_RNA < cut_nCount_RNA] ``` ## Visualization This is difficult to handle the distribution of these metrics across cells. We opt for various visualization ways: * **histogram**, showing the distribution of the metric * **violin plot**, showing the distribution of the metric, useful if several datasets are considered * **UMAP** (or alternatively, tSNE), showing the distribution of the metric over a 2D projection of cells You may choose one of these visualization ways. To generate the UMAP 2D projection, we need to run multiple `Seurat` commands in a row. Understanding these commands is not the purpose of the current course, but will be detailed in the next few ones. So just run : ```{r quick_processing, message=FALSE, warning=FALSE} # quick_processing sobj_tmp = Seurat::NormalizeData(sobj, verbose = FALSE) sobj_tmp = Seurat::ScaleData(sobj_tmp, verbose = FALSE) sobj_tmp = Seurat::FindVariableFeatures(sobj_tmp, verbose = FALSE) sobj_tmp = Seurat::RunPCA(sobj_tmp, npcs = 21, verbose = FALSE) sobj_tmp = Seurat::RunUMAP(sobj_tmp, dims = c(1:20), verbose = FALSE) ``` We can now visualize the cells on a 2D projection: ```{r visualize_cells, fig.width = 8, fig.height = 8} # visualize_cells Seurat::DimPlot(object = sobj_tmp, reduction = "umap") + Seurat::DarkTheme() ``` ### Define a R function We want to visualize our different metrics, but using the same kind of plots. Instead of copying-pasting and editing the same code multiple times (this is time-consuming and error-prone), we design a **function** using the template below: ```{r my_function_name, eval = FALSE} # my_function_name my_function_name = function(param1, param2) { # do something with the parameter values output = "something" return(output) } ``` Here is the function we will use : ```{r qc_print_function} # qc_print_function print_1_qc_metric = function(object = sobj, qc = "log10_nCount_RNA", cut_qc = cut_log10_nCount_RNA, failing_cells = fail_nCount_RNA) { # Description of the parameters: # - sobj : the Seurat object, with default value to the one # - qc : CHARACTER : the QC metric, must be a column in sobj@meta.data # - cut_qc : NUMERIC : the filtering threshold for the QC metric # - failing_cells : CHARACTER VECTOR : the cells that fail the QC # Histogram p_hist = ggplot(object@meta.data, aes(x = .data[[qc]])) + geom_histogram(aes(y = after_stat(density)), colour = "black", fill = "#F8766D", bins = 100) + geom_density(alpha = 0, col = "blue", lwd = 0.75) + geom_vline(xintercept = cut_qc, col = "red") + labs(title = paste0("Threshold for ", qc, " is: ", cut_qc)) + theme_classic() + theme(plot.title = element_text(hjust = 0.5)) # Violin plot p_violin = Seurat::VlnPlot(object, features = qc) + geom_hline(yintercept = cut_qc, col = "red") + theme(axis.title.x = element_blank(), legend.position = "none") # Feature plot p_umap = Seurat::FeaturePlot(object, reduction = "umap", features = qc) + scale_color_gradientn(colors = color_palette) + theme(aspect.ratio = 1) # Dim plot object$failorpass = as.factor( ifelse(test = colnames(object) %in% failing_cells, yes = "fail", no = "pass") ) p_fail = Seurat::DimPlot(object, group.by = "failorpass", order = "fail") + scale_color_manual(values = c("#F8766D", "gray80"), breaks = levels(object$failorpass)) + labs(title = qc, subtitle = paste0(length(failing_cells), " cells fail (", round(100*length(failing_cells)/ncol(sobj), 2), " %)")) + theme(aspect.ratio = 1, plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5)) ## If we applied the filter objectf = subset(object, cells = colnames(object)[object$failorpass == 'pass']) objectf = Seurat::ScaleData(objectf, verbose = FALSE) objectf = Seurat::FindVariableFeatures(objectf, verbose = FALSE) objectf = Seurat::RunPCA(objectf, npcs = 21, verbose = FALSE) objectf = Seurat::RunUMAP(objectf, dims = c(1:20), verbose = FALSE) p_umapf = Seurat::FeaturePlot(object = objectf, reduction = "umap", features = qc) + scale_color_gradientn(colors = color_palette) + theme(aspect.ratio = 1) # Patchwork p = patchwork::wrap_plots(p_umap, p_fail, p_hist, p_violin, p_umapf) + patchwork::plot_layout(nrow = 1, widths = c(1, 1, 2, 1, 1)) return(p) } ``` ### Number of UMI #### Check the function This function should directly work for `log10_nCount_RNA`, as it is set for this metric by default (`qc = "log10_nCount_RNA"`) : ```{r check1_log10_nCount_RNA, fig.width = 18, fig.height = 4} # check1_log10_nCount_RNA print_1_qc_metric(object = sobj_tmp) ``` Of course, it will work the same by expliciting all of the parameters value: ```{r check2_log10_nCount_RNA, fig.width = 18, fig.height = 4} # check2_log10_nCount_RNA print_1_qc_metric(object = sobj_tmp, qc = "log10_nCount_RNA", cut_qc = cut_log10_nCount_RNA, failing_cells = fail_nCount_RNA) ``` So we can copy-paste only this chunk for the next QC metrics ! ### Number of expressed genes ```{r see_nFeature_RNA, fig.width = 18, fig.height = 4} # see_nFeature_RNA print_1_qc_metric(object = sobj_tmp, qc = "nFeature_RNA", cut_qc = cut_nFeature_RNA, failing_cells = fail_nFeature_RNA) ``` ### Mitochondrial genes expression ```{r see_percent_mt, fig.width = 18, fig.height = 4} # see_percent_mt print_1_qc_metric(object = sobj_tmp, qc = "percent_mt", cut_qc = cut_percent_mt, failing_cells = fail_percent_mt) ``` ### Riboprotein-coding genes expression ```{r see_percent_rb, fig.width = 18, fig.height = 4} # see_percent_rb print_1_qc_metric(object = sobj_tmp, qc = "percent_rb", cut_qc = cut_percent_rb, failing_cells = fail_percent_rb) ``` ### Mechanical stress genes expression ```{r see_percent_st, fig.width = 18, fig.height = 4} # see_percent_st print_1_qc_metric(object = sobj_tmp, qc = "percent_st", cut_qc = cut_percent_st, failing_cells = fail_percent_st) ``` As we have finished with the visualization, we can discard our temporary object : ```{r rm_sobj_tmp} # rm_sobj_tmp rm(sobj_tmp) ``` ### All metrics effect We can visualize metrics filtering effect as an **upset-plot** ```{r bcupset, class.source="notrun", class.output="notruno"} # bcupset ## Create a list of all cells filtered OUT for each criterion up_list <-list( "nFeature" = fail_nFeature_RNA, "nCount" = fail_nCount_RNA, "%MITO" = fail_percent_mt, "%RIBO" = fail_percent_rb, "%STRESS" = fail_percent_st) ## Create an upset-plot UpSetR::upset(data = UpSetR::fromList(up_list), nintersects = NA, sets = rev(names(up_list)), keep.order = TRUE, order.by = "freq") ``` # Filtering ? We could filter out cells based on these 5 QC metrics now. It is also possible to wait, perform various annotations such as cell type annotation or cell cycle phase scoring, to better characterize the low quality cells. To filter a Seurat object, we use the `Seurat::subset()` function: ```{r filter_sobj} # filter_sobj sobj_filtered = subset(sobj, invert = TRUE, cells = unique(c(fail_nCount_RNA, fail_nFeature_RNA, fail_percent_mt, # fail_percent_rb, actually we decided not to filter on this parameter fail_percent_st ))) sobj_filtered ``` We are **not going to save** this filtered Seurat object, but we want to store in our `sobj` Seurat object the cells that failed the QC as a new cell metadata : ```{r add_fail_qc} # add_fail_qc sobj$fail_qc = ifelse(test = colnames(sobj) %in% colnames(sobj_filtered), yes = "pass", no = "fail") table(sobj$fail_qc) ``` # Save We save the **unfiltered** Seurat object: ```{r saverds1, fold.output = FALSE} # saverds1 ## Save our Seurat object (rich naming) out_name <- paste0( output_dir, "/", paste( c("02", Seurat::Project(sobj), "S5", "Metrics", paste( dim(sobj), collapse = '.' ) ), collapse = "_"), ".RDS") ## Check print(out_name) ## Write on disk saveRDS(object = sobj, file = out_name) ``` This Seurat object would then be loaded as the input for further analyses. # R session This is a good practice to show the version of the packages used in this notebook. ```{r rsession, class.source = "fold-hide"} # rsession sessionInfo() ```