---
title: "
EB3I n1 2025 scRNAseq
-
PROCESSING (I)
-
Normalization & scaling"
date: "2025-16-21.22"
author:
- name: "EB3I n1 scRNAseq Team"
- name: "Lilia YOUNSI"
email: "lilia.younsi@inserm.fr"
- name: "Sebastien MELLA"
email: "sebastien.mella@pasteur.fr"
output:
rmdformats::readthedown:
fig_width: 8
fig_height: 6
highlight: tango ## Theme for the code chunks
embed_fonts: TRUE
number_sections: true ## Adds number to headers (sections)
theme: flatly ## CSS theme for the HTML page
collapsed: true ## By default, the TOC is folded
toc_depth: 3
smooth_scroll: true ## Smooth scroll of the HTML page
self_contained: true ## Includes all plots/images within the HTML
code_download: true ## Adds a button to download the Rmd
code_folding: show
thumbnails: false
lightbox: true
fig_caption: false
gallery: true
use_bookdown: true
always_allow_html: true ## Allow plain HTML code in the Rmd
editor_options:
markdown:
wrap: 72
---
```{r knit_setup, echo = FALSE}
knitr::opts_chunk$set(
echo = TRUE, # Print the code
eval = TRUE, # Run command lines
message = FALSE, # Print messages
prompt = FALSE, # Do not display prompt
comment = NA, # No comments on this section
warning = FALSE, # Display warnings
tidy = FALSE,
fig.align="center",
# results = 'hide',
width = 100 # Number of characters per line
)
```
```{css, echo=FALSE}
.notrun {
background-color: lightgrey !important;
border: 3px solid black !important;
}
.notruno {
background-color: lightgrey !important;
color : black !important;
}
.question {
background-color: aquamarine !important;
color : black !important;
border: 3px solid limegreen !important;
}
.questiono {
background-color: aquamarine !important;
color : black !important;
}
.answer {
background-color: navajowhite !important;
border: 3px solid brown !important;
}
.answero {
background-color: navajowhite !important;
color : black !important;
}
.beyond {
background-color: violet !important;
border: 3px solid purple !important;
}
.beyondo {
background-color: violet !important;
color : black !important;
}
```
```{r knit_hook, echo = FALSE}
hooks = knitr::knit_hooks$get()
hook_foldable = function(type) {
force(type)
function(x, options) {
res = hooks[[type]](x, options)
if (isFALSE(options[[paste0("fold.", type)]])) return(res)
paste0(
"Show ", type, "
\n\n",
res,
"\n\n "
)
}
}
knitr::knit_hooks$set(
output = hook_foldable("output"),
plot = hook_foldable("plot")
)
```
------------------------------------------------------------------------
------------------------------------------------------------------------

------------------------------------------------------------------------
------------------------------------------------------------------------
# PREAMBLE
## Purpose of this session
This practical session is designed to help you understand the importance of normalization, scaling, and regression in single-cell RNA-seq analysis. You will apply these steps using different methods and visualize the effects on the data.
**Input data**: Seurat object containing the filtered count matrix from the previous class. It's called `05_TD3A_S5_Doublets.Filtered_12508.4035.RDS`.
**Output data**: Seurat object after the normalization, scaling and regression. It's called `sobj_TD3A_normalized.rds`.
## Learning Objectives
- Discuss why normalizing counts is necessary to compare cells
- Be able to understand different normalization approaches
- Be able to decide when to regress out a given variable, by evaluating the effects from any unwanted sources of variation
## Motivation
Up to this point, we have filtered out empty droplets, ambient RNA contamination, low-quality cells, and doublets.
The data is now a count matrix with cells as rows and genes as columns. These counts reflect the capture, reverse transcription, and sequencing steps of the scRNA-seq experiment, each introducing variability. This means the observed differences in gene expression may be influenced by sampling noise rather than true biological differences, resulting in uneven variance across the dataset.
Normalization addresses this by adjusting raw counts to minimize the effects of variable sampling, making the data more suitable for analysis, which often assumes uniform variance. Various normalization techniques exist, each tailored to support different downstream analyses.
A recent study by Ahlmann-Eltze and Huber (2023) benchmarked 22 normalization methods. It's essential to select the normalization method based on the specific analysis goals. However, for this practical we will just explore the most well-known way to doing single-cell normalization.
## Setup
We first load the packages that are needed :
```{r load-libraries, message=FALSE, warning=FALSE}
library(Seurat)
library(ggplot2)
library(patchwork)
library(scater)
# library(dplyr)
library(reshape2)
```
------------------------------------------------------------------------
------------------------------------------------------------------------
# Start Rstudio
- Using the [OpenOnDemand cheat
sheet](https://moodle.france-bioinformatique.fr/pluginfile.php/1475/mod_folder/content/0/OoD_R_Rstudio.html){target="_blank"},
connect to the [OpenOnDemand
portal](https://ondemand.cluster.france-bioinformatique.fr){target="_blank"} and **create a Rstudio session** with the right resource requirements, thanks
to the cheat sheet.
------------------------------------------------------------------------
------------------------------------------------------------------------
# Warm-up
- We now set **common parameters** as new variables, once and for all for this
session :
```{r setparam}
# setparam
## Set your project name
# WARNING : Do not just copy-paste this ! It's MY project name ! Put YOURS !!
project_name <- "ebaii_sc_teachers"
## Control if the project_name exists on the cluster
cat('PATH CHECK : ', dir.exists(paste0('/shared/projects/', project_name)))
## Seed for the RNG
my_seed <- 1337L
```
------------------------------------------------------------------------
------------------------------------------------------------------------
# Prepare the data structure
## Main directory
```{r maindir}
# maindir
## Preparing the path
TD_dir <- paste0("/shared/projects/", project_name, "/SC_TD")
## Creating the root directory
# dir.create(path = TD_dir, recursive = TRUE)
## Print the root directory on-screen
print(TD_dir)
```
## Current session
```{r sessiondir}
# sessiondir
## Creating the session (Proc.1) directory
session_dir <- paste0(TD_dir, "/04_Proc.1")
dir.create(path = session_dir, recursive = TRUE)
## Print the session directory on-screen
print(session_dir)
```
## Input directory
```{r indir}
# indir
## Creating the INPUT data directory
input_dir <- paste0(session_dir, "/DATA")
dir.create(path = input_dir, recursive = TRUE)
## Print the input directory on-screen
print(input_dir)
```
## Output directory
```{r outdir}
# outdir
## Creating the OUTPUT data directory
output_dir <- paste0(session_dir, "/RESULTS")
dir.create(path = output_dir, recursive = TRUE)
## Print the output directory on-screen
print(output_dir)
```
# Reload the Seurat Object
- We can reload the object we saved at the former step
```{r dataload}
# dataload
## This is the path to the current EB3I backup
sessionid <- '2538_eb3i_n1_2025'
## The latest Seurat object saved as RDS (name)
sobj_file <- "05_TD3A_S5_Doublets.Filtered_12508.4035.RDS"
## The latest Seurat object saved as RDS (full path)
sobj_path <- paste0(TD_dir,
"/03_Preproc.3/RESULTS/",
sobj_file)
force <- FALSE ## To force a re-download of a Zenodo-hosted backup
local <- FALSE ## To force a loading from a local backup
## In case of error/lost data : force a reload from a Zenodo backup repository
if(force) {
zen_id <- "14035293"
zen_backup_file <- paste0("https://zenodo.org/records/",
zen_id,
"/files/",
sobj_file)
## Recreate the expected path if it does not exist
dir.create(path = dirname(sobj_path), recursive = TRUE)
## Download the file
download.file(url = zen_backup_file,
destfile = sobj_path)
}
## In case of error/lost data : force a reload from a local backup repository
if(local) {
sobj_path <- paste0(
"/shared/projects/", sessionid, "/atelier_scrnaseq/TD/BACKUP/RDS/",
sobj_file)
}
## Load the object
sobj <- readRDS(file = sobj_path)
```
------------------------------------------------------------------------
------------------------------------------------------------------------
## Exploring the object
```{r printout_sobj}
sobj
```
What's in the metadata ?
```{r explore_metadata}
head(sobj@meta.data)
```
What are the assays ?
```{r explore_assays}
Seurat::Assays(object = sobj)
```
What's a layer ? How to know how many there are ?
```{r explore_rna_layers}
Layers(object = sobj, assay = "RNA")
```
Display first 10 row and first 10 columns of raw matrice
```{r explore_count_mat}
sobj@assays$RNA@layers$counts[1:10, 1:10]
```
If there is another matrix, display its first 10 row and first 10 columns
```{r, explore_data_mat}
sobj@assays$RNA@layers$data[1:10, 1:10]
```
### Assessing of the level of data sparcity
```{r extract raw_counts}
raw_count <- GetAssayData(object = sobj, assay = "RNA", layer = "counts")
```
Can you spot the difference between `sobj@assays$RNA@layers$data[1:10, 1:10]` and `raw_count[1:10, 1:10]`
```{r explore raw_counts}
raw_count[1:10, 1:10]
```
```{r proptable_sparce}
prop.table(
table(as.matrix(raw_count)>0)
)*100
```
```{r mean_sparse}
mean(as.matrix(raw_count)>0)*100
```
Extracting genes attributes
```{r gene_attr_df}
gene_attr <- data.frame("mean" = rowMeans(raw_count),
"detection_rate" = rowMeans(raw_count > 0),
"var" = apply(raw_count, 1, var))
gene_attr$log_mean <- log10(gene_attr$mean)
gene_attr$log_var <- log10(gene_attr$var)
```
Extracting cell attributes
```{r cell_attr_df}
cell_attr <- data.frame(n_umi = colSums(raw_count),
n_gene = colSums(raw_count >0))
```
Mean-Variance relationship
```{r varBYmeanPlot}
ggplot(gene_attr, aes(x = mean, y = var))+
geom_point()
```
Can we do better ??
```{r}
ggplot(gene_attr, aes(x = log_mean, y = log_var)) +
geom_point(alpha = 0.3, shape = 16) +
geom_abline(intercept = 0, slope = 1, color = "red")+
geom_vline(xintercept = log10(mean(gene_attr$mean)), linetype = "dashed")
```
Other types of relevant information to look at
```{r}
summary(sobj$nCount_RNA)
summary(sobj$nFeature_RNA)
summary(sobj$percent_mt*100)
```
Relationship between nUMI & number of detected genes
```{r nGenes_nUMIseurat}
FeatureScatter(
object = sobj,
feature1 = "nCount_RNA",
feature2 = "nFeature_RNA",
cols = "grey10"
)
```
```{r nGenes_nUMIggplot}
ggplot(sobj@meta.data,
aes(x = nCount_RNA, y = nFeature_RNA))+
geom_point()
```
```{r mito_nUMI}
FeatureScatter(
object = sobj,
feature1 = "nCount_RNA",
feature2 = "percent_mt",
cols = "grey10"
)
```
## Normalize data: Shifted logarithm method
```{r lognorm}
sobj <- NormalizeData(object = sobj,
normalization.method = "LogNormalize",
scale.factor = 1e4)
```
Visualize the effect of the normalization
```{r extract lognormmat}
logNorm_count <- GetAssayData(object = sobj, assay = "RNA", layer = "data")
```
```{r}
SumCount_df <- data.frame(
"raw" = colSums(raw_count),
"logNorm" = colSums(logNorm_count)
)
```
```{r, fig.width=12, fig.height=4}
(ggplot(SumCount_df, aes(raw))+
geom_density()+
ggtitle("Raw (i.e not normalized) data")+
theme(plot.title = element_text(face = "bold", hjust = .5, color = "steelblue")))+ (ggplot(SumCount_df, aes(logNorm))+
geom_density()+
ggtitle("Lognorm data")+
theme(plot.title = element_text(face = "bold", hjust = .5, color = "steelblue"))
)
```
Effect of the normalization on Expression of genes
```{r, fig.width=12, fig.height=6}
raw_exp <- VlnPlot(
object = sobj,
features = c("Srm", "Apoe", "Top2a"),
assay = "RNA",
layer = "counts"
)
norm_exp <- VlnPlot(
object = sobj,
features = c("Srm", "Apoe", "Top2a"),
assay = "RNA",
layer = "data"
)
wrap_plots(raw_exp,norm_exp) + plot_layout(nrow = 2)
```
```{r}
logNorm_exp_3genes <- FetchData(object = sobj, vars = c("Srm", "Apoe", "Top2a"), layer = "data")
logNorm_exp_3genes_l <- reshape2::melt(logNorm_exp_3genes)
```
```{r}
raw_exp_3genes <- FetchData(object = sobj, vars = c("Srm", "Apoe", "Top2a"), layer = "counts")
raw_exp_3genes_l <- reshape2::melt(raw_exp_3genes)
```
```{r, fig.width=12, fig.height=4}
(ggplot(data = raw_exp_3genes_l, aes(x = variable, y = value))+
geom_jitter(height = 0, width = .2, size = .1)+
geom_boxplot(
data = raw_exp_3genes_l[raw_exp_3genes_l$value>0,],
outliers = F,
size = .5,
width = .1)+
ggtitle("Raw Expression")+
theme(plot.title = element_text(size = 16, color = "grey50", face = "bold", hjust = .5))) + (ggplot(data = logNorm_exp_3genes_l, aes(x = variable, y = value))+
geom_jitter(height = 0, width = .2, size = .1)+
geom_boxplot(
data = logNorm_exp_3genes_l[logNorm_exp_3genes_l$value>0,],
outliers = F,
size = .5,
width = .1)+
ggtitle("logNorm Expression")+
theme(plot.title = element_text(size = 16, color = "grey50", face = "bold", hjust = .5)))
```
## Normalize data II: What's the name already ???
```{r sctransform}
sobj <- SCTransform(object = sobj)
```
How this normalization affect the expression of the 3 genes ??
```{r}
sct_exp <- VlnPlot(
object = sobj,
features = c("Srm", "Apoe", "Top2a"),
assay = "SCT",
layer = "data"
)
```
```{r, fig.width=12, fig.height=12}
wrap_plots(raw_exp,norm_exp,sct_exp) + plot_layout(nrow = 3)
```
```{r}
SCT_exp_3genes <- FetchData(object = sobj, vars = c("Srm", "Apoe", "Top2a"), assay = "SCT", layer = "data")
SCT_exp_3genes_l <- reshape2::melt(SCT_exp_3genes)
```
```{r, fig.width=12, fig.height=4}
(ggplot(data = raw_exp_3genes_l, aes(x = variable, y = value))+
geom_jitter(height = 0, width = .2, size = .1)+
geom_boxplot(
data = raw_exp_3genes_l[raw_exp_3genes_l$value>0,],
outliers = F,
size = .5,
width = .1)+
ggtitle("Raw Expression")+
theme(plot.title = element_text(size = 16, color = "grey50", face = "bold", hjust = .5))) + (ggplot(data = logNorm_exp_3genes_l, aes(x = variable, y = value))+
geom_jitter(height = 0, width = .2, size = .1)+
geom_boxplot(
data = logNorm_exp_3genes_l[logNorm_exp_3genes_l$value>0,],
outliers = F,
size = .5,
width = .1)+
ggtitle("logNorm Expression")+
theme(plot.title = element_text(size = 16, color = "grey50", face = "bold", hjust = .5))) + (ggplot(data = SCT_exp_3genes_l, aes(x = variable, y = value))+
geom_jitter(height = 0, width = .2, size = .1)+
geom_boxplot(
data = SCT_exp_3genes_l[SCT_exp_3genes_l$value>0,],
outliers = F,
size = .5,
width = .1)+
ggtitle("SCT Expression")+
theme(plot.title = element_text(size = 16, color = "grey50", face = "bold", hjust = .5)))
```
Compare the Layers available in the different "Assays" available
```{r}
Layers(sobj, assay = "SCT")
```
```{r}
Layers(sobj, assay = "RNA")
```
What's missing ?
```{r}
sobj <- ScaleData(object = sobj, assay = "RNA")
```
```{r}
Layers(sobj, assay = "RNA")
```
```{r}
Reductions(object = sobj)
```
```{r findrnaHVG}
sobj <- FindVariableFeatures(object = sobj, assay = "RNA")
```
Compare the variance-stabilization for the 2 methods
```{r}
rna_hvg <- VariableFeaturePlot(
object = sobj,
assay = "RNA"
)
sct_hvg <- VariableFeaturePlot(
object = sobj,
assay = "SCT"
)
```
```{r, fig.width=12, fig.height=4}
wrap_plots(rna_hvg,sct_hvg)+plot_layout(ncol = 2)
```
```{r}
rna_hvg+ylim(c(0,10))
```
```{r}
sct_hvg+ylim(c(0,10))
```
Comparing the 2 normalization methods at the PCA level
```{r compute PCA}
sobj <- RunPCA(object = sobj, assay = "RNA", reduction.name = "pcaRNA", reduction.key = "PCrna_")
sobj <- RunPCA(object = sobj, assay = "SCT", reduction.name = "pcaSCT", reduction.key = "PCsct_")
```
```{r rnageneLoad_PC}
print(sobj[["pcaRNA"]], dims = 1:5, nfeatures = 30)
```
```{r sctgeneLoad_PC}
print(sobj[["pcaSCT"]], dims = 1:5, nfeatures = 30)
```
```{r display dimplots, fig.width=12, fig.height=6}
DimPlot(object = sobj, reduction = "pcaRNA")+DimPlot(object = sobj, reduction = "pcaSCT")
```
```{r, fig.width=12, fig.height=6}
FeaturePlot(object = sobj, reduction = "pcaRNA", features = "nCount_RNA")
FeaturePlot(object = sobj, reduction = "pcaSCT", features = "nCount_RNA")
```
```{r, fig.width=12, fig.height=5}
FeatureScatter(object = sobj,
feature1 = "PCrna_1",
feature2 = "nCount_RNA") + FeatureScatter(object = sobj,
feature1 = "PCsct_1",
feature2 = "nCount_RNA")
```
```{r, fig.width=12, fig.height=5}
FeatureScatter(object = sobj,
feature1 = "PCrna_1",
feature2 = "CC_Seurat_S.Score") + FeatureScatter(object = sobj,
feature1 = "PCsct_1",
feature2 = "CC_Seurat_S.Score")
```
```{r, fig.width=12, fig.height=6}
DimPlot(object = sobj,
reduction = "pcaRNA",
group.by = "CC_Seurat_Phase")+DimPlot(object = sobj,
reduction = "pcaSCT",
group.by = "CC_Seurat_Phase")
```
## Regressing out unwanted source of variation
```{r regOutccg}
sobj_ccg <- ScaleData(object = sobj,
vars.to.regress = c("CC_Seurat_S.Score", "CC_Seurat_G2M.Score"),
assay = "RNA")
sobj_ccg <- RunPCA(object = sobj_ccg, assay = "RNA", reduction.name = "pcaRNAccg", reduction.key = "PCrnaccg_")
```
```{r, fig.width=12, fig.height=6}
DimPlot(object = sobj,
reduction = "pcaRNA",
group.by = "CC_Seurat_Phase")+DimPlot(object = sobj,
reduction = "pcaSCT",
group.by = "CC_Seurat_Phase")+DimPlot(object = sobj_ccg,
reduction = "pcaRNAccg",
group.by = "CC_Seurat_Phase")
```
```{r}
print(sobj_ccg[["pcaRNAccg"]], dims = 1:5, nfeatures = 30)
```
# R session
This is a good practice to show the version of the packages used in this notebook.
```{r rsession, class.source = "fold-hide"}
# rsession
sessionInfo()
```