This practical session is designed to help you understand the importance of normalization, scaling, and regression in single-cell RNA-seq analysis. You will apply these steps using different methods and visualize the effects on the data.
Input data: Seurat object containing the filtered
count matrix from the previous class. It’s called
05_TD3A_S5_Doublets.Filtered_12508.4035.RDS
.
Output data: Seurat object after the normalization,
scaling and regression. It’s called
sobj_TD3A_normalized.rds
.
Up to this point, we have filtered out low-quality cells, ambient RNA contamination, and doublets. The data is now a count matrix with cells as columns and genes as rows. These counts reflect the capture, reverse transcription, and sequencing steps of the scRNA-seq experiment, each introducing variability. This means the observed differences in gene expression may be influenced by sampling noise rather than true biological differences, resulting in uneven variance across the dataset.
Normalization addresses this by adjusting raw counts to minimize the effects of variable sampling, making the data more suitable for analysis, which often assumes uniform variance. Various normalization techniques exist, each tailored to support different downstream analyses.
A recent study by Ahlmann-Eltze and Huber (2023) benchmarked 22 normalization methods. It’s essential to select the normalization method based on the specific analysis goals. However, for this practical we will just explore the most well-known way to doing single-cell normalization.
We first load the packages that are needed :
library(Seurat)
library(ggplot2)
library(patchwork)
library(scater)
library(dplyr)
We then import the data that have been previously filtered (reminder : we removed low quality cells, removed ambient RNA and scored doublets) :
sobj <- readRDS("input/05_TD3A_S5_Doublets.Filtered_12508.4035.RDS")
sobj
## An object of class Seurat
## 12508 features across 4035 samples within 1 assay
## Active assay: RNA (12508 features, 0 variable features)
## 2 layers present: counts, data
For the sake of this practical session, we will use a smaller dataset
which contains only 1000 cells. In order to do this, we can use the
Seurat objet as a dataframe, where rows are genes, and columns are
cells. In R language, this corresponds to :
sobj[genes,cells]
.
We will take the first 1000 cells, by using the range
1:1000
, like this :
# Select the cells to keep
sobj_tiny <- sobj[,1:1000]
# Check that the object contains 1000 cells
sobj_tiny
## An object of class Seurat
## 12508 features across 1000 samples within 1 assay
## Active assay: RNA (12508 features, 0 variable features)
## 2 layers present: counts, data
NormalizeData
functionLet’s start by exploring the NormalizeData
function.
The input is your Seurat object (for us : sobj_tiny
)
Can you spot which is the default one ?
There are 3 normalization approaches included in Seurat : LogNormalize, CLR and RC. The default one is LogNormalize.
NormalizeData
function on the
sobj_tiny
You can now explore your new object : where are the normalized values stored ? Can you print the values of the first 4 genes in the 10 first cells ?
sobj_tiny <- NormalizeData(sobj_tiny)
# The normalized values are stored in the slot 'data' :
sobj_tiny@assays$RNA$data[1:4, 1:10]
## 4 x 10 sparse Matrix of class "dgCMatrix"
##
## Mrpl15 . 2.365796 . . . . . 0.6103322
## Lypla1 . . . 1.844929 . 1.627274 1.442936 0.6103322
## Gm37988 . . . . . . . .
## Tcea1 . . 1.781829 . 1.708152 1.627274 2.010388 .
##
## Mrpl15 . .
## Lypla1 . 1.052803
## Gm37988 . .
## Tcea1 1.701157 1.052803
# You can compare it to the slot 'count' (raw count) :
sobj_tiny@assays$RNA$count[1:4, 1:10]
## 4 x 10 sparse Matrix of class "dgCMatrix"
##
## Mrpl15 . 2 . . . . . 1 . .
## Lypla1 . . . 1 . 1 1 1 . 1
## Gm37988 . . . . . . . . . .
## Tcea1 . . 1 . 1 1 2 . 1 1
LogNormalize
LogNormalize
method or “scaling normalization” is the
simplest and most commonly used class of normalization approaches. This
involves dividing all counts for each cell by a cell-specific scaling
factor, often called a “size factor” (Anders and Huber 2010). The
assumption here is that any cell-specific bias (e.g., in capture or
amplification efficiency) affects all genes equally via scaling of the
expected mean count for that cell. The size factor for each cell
represents the estimate of the relative bias in that cell, so division
of its counts by its size factor should remove that bias. The resulting
“normalized expression values” can then be used for downstream analyses
such as clustering and dimensionality reduction. (From https://bioconductor.org/books/3.14/OSCA.basic/normalization.html)
# Apply LogNormalize
sobj_tiny <- NormalizeData(sobj_tiny,
normalization.method = "LogNormalize",
scale.factor = 10000)
We can now visualize the effect of LogNormalize
with
some violin plots :
VlnPlot(sobj_tiny,
features = c("Srm", "Apoe", "Top2a"),
layer = "data")
Violin plots are sometimes hard to read. Here is a function to build box plots out of Seurat objects :
BoxPlotFromSeurat <- function(seurat_obj, features, log = TRUE, layer = "data", plot.title = "Box Plot") {
# Extract the data for the features of interest
df <- Seurat::FetchData(seurat_obj, vars = features, layer = layer)
# Add metadata for grouping if group.by is specified
df <- cbind(df, seurat_obj[[]])
# Reshape the data for ggplot2
df_long <- tidyr::gather(df, key = "feature", value = "expression", group)
df_long <- df %>% group_by_(group.by) %>%
tidyr::pivot_longer(cols = features)
# Create box plots using ggplot2
if (log == TRUE) {
p <- ggplot(df_long, ggplot2::aes(x = name, y = log10(value))) +
geom_boxplot(outlier.shape = NA) +
theme_minimal() +
labs(x = "Feature", y = "Expression Level (log10)") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
} else {
p <- ggplot(df_long, ggplot2::aes(x = name, y = value)) +
geom_boxplot(outlier.shape = NA) +
theme_minimal() +
labs(x = "Feature", y = "Expression Level") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
}
return(p)
}
# Example usage
# BoxPlotFromSeurat(sobj_tiny, features = c("Srm", "Apoe", "Top2a"))
You can try also with the boxplots.
VlnPlot(sobj_tiny,
features = c("Srm", "Apoe", "Top2a"),
layer = "count")