This practical session is designed to help you understand the importance of normalization, scaling, and regression in single-cell RNA-seq analysis. You will apply these steps using different methods and visualize the effects on the data.
Input data: Seurat object containing the filtered
count matrix from the previous class. It’s called
05_TD3A_S5_Doublets.Filtered_12508.4035.RDS
.
Output data: Seurat object after the normalization,
scaling and regression. It’s called
sobj_TD3A_normalized.rds
.
Up to this point, we have filtered out low-quality cells, ambient RNA contamination, and doublets. The data is now a count matrix with cells as columns and genes as rows. These counts reflect the capture, reverse transcription, and sequencing steps of the scRNA-seq experiment, each introducing variability. This means the observed differences in gene expression may be influenced by sampling noise rather than true biological differences, resulting in uneven variance across the dataset.
Normalization addresses this by adjusting raw counts to minimize the effects of variable sampling, making the data more suitable for analysis, which often assumes uniform variance. Various normalization techniques exist, each tailored to support different downstream analyses.
A recent study by Ahlmann-Eltze and Huber (2023) benchmarked 22 normalization methods. It’s essential to select the normalization method based on the specific analysis goals. However, for this practical we will just explore the most well-known way to doing single-cell normalization.
We first load the packages that are needed :
library(Seurat)
library(ggplot2)
library(patchwork)
library(scater)
library(dplyr)
We then import the data that have been previously filtered (reminder : we removed low quality cells, removed ambient RNA and scored doublets) :
sobj <- readRDS("input/05_TD3A_S5_Doublets.Filtered_12508.4035.RDS")
sobj
## An object of class Seurat
## 12508 features across 4035 samples within 1 assay
## Active assay: RNA (12508 features, 0 variable features)
## 2 layers present: counts, data
For the sake of this practical session, we will use a smaller dataset
which contains only 1000 cells. In order to do this, we can use the
Seurat objet as a dataframe, where rows are genes, and columns are
cells. In R language, this corresponds to :
sobj[genes,cells]
.
We will take the first 1000 cells, by using the range
1:1000
, like this :
# Select the cells to keep
sobj_tiny <- sobj[,1:1000]
# Check that the object contains 1000 cells
sobj_tiny
## An object of class Seurat
## 12508 features across 1000 samples within 1 assay
## Active assay: RNA (12508 features, 0 variable features)
## 2 layers present: counts, data
NormalizeData
functionLet’s start by exploring the NormalizeData
function.
The input is your Seurat object (for us : sobj_tiny
)
Can you spot which is the default one ?
There are 3 normalization approaches included in Seurat : LogNormalize, CLR and RC. The default one is LogNormalize.
NormalizeData
function on the
sobj_tiny
You can now explore your new object : where are the normalized values stored ? Can you print the values of the first 4 genes in the 10 first cells ?
sobj_tiny <- NormalizeData(sobj_tiny)
# The normalized values are stored in the slot 'data' :
sobj_tiny@assays$RNA$data[1:4, 1:10]
## 4 x 10 sparse Matrix of class "dgCMatrix"
##
## Mrpl15 . 2.365796 . . . . . 0.6103322
## Lypla1 . . . 1.844929 . 1.627274 1.442936 0.6103322
## Gm37988 . . . . . . . .
## Tcea1 . . 1.781829 . 1.708152 1.627274 2.010388 .
##
## Mrpl15 . .
## Lypla1 . 1.052803
## Gm37988 . .
## Tcea1 1.701157 1.052803
# You can compare it to the slot 'count' (raw count) :
sobj_tiny@assays$RNA$count[1:4, 1:10]
## 4 x 10 sparse Matrix of class "dgCMatrix"
##
## Mrpl15 . 2 . . . . . 1 . .
## Lypla1 . . . 1 . 1 1 1 . 1
## Gm37988 . . . . . . . . . .
## Tcea1 . . 1 . 1 1 2 . 1 1
LogNormalize
LogNormalize
method or “scaling normalization” is the
simplest and most commonly used class of normalization approaches. This
involves dividing all counts for each cell by a cell-specific scaling
factor, often called a “size factor” (Anders and Huber 2010). The
assumption here is that any cell-specific bias (e.g., in capture or
amplification efficiency) affects all genes equally via scaling of the
expected mean count for that cell. The size factor for each cell
represents the estimate of the relative bias in that cell, so division
of its counts by its size factor should remove that bias. The resulting
“normalized expression values” can then be used for downstream analyses
such as clustering and dimensionality reduction. (From https://bioconductor.org/books/3.14/OSCA.basic/normalization.html)
# Apply LogNormalize
sobj_tiny <- NormalizeData(sobj_tiny,
normalization.method = "LogNormalize",
scale.factor = 10000)
We can now visualize the effect of LogNormalize
with
some violin plots :
VlnPlot(sobj_tiny,
features = c("Srm", "Apoe", "Top2a"),
layer = "data")
Violin plots are sometimes hard to read. Here is a function to build box plots out of Seurat objects :
BoxPlotFromSeurat <- function(seurat_obj, features, log = TRUE, layer = "data", plot.title = "Box Plot") {
# Extract the data for the features of interest
df <- Seurat::FetchData(seurat_obj, vars = features, layer = layer)
# Add metadata for grouping if group.by is specified
df <- cbind(df, seurat_obj[[]])
# Reshape the data for ggplot2
df_long <- tidyr::gather(df, key = "feature", value = "expression", group)
df_long <- df %>% group_by_(group.by) %>%
tidyr::pivot_longer(cols = features)
# Create box plots using ggplot2
if (log == TRUE) {
p <- ggplot(df_long, ggplot2::aes(x = name, y = log10(value))) +
geom_boxplot(outlier.shape = NA) +
theme_minimal() +
labs(x = "Feature", y = "Expression Level (log10)") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
} else {
p <- ggplot(df_long, ggplot2::aes(x = name, y = value)) +
geom_boxplot(outlier.shape = NA) +
theme_minimal() +
labs(x = "Feature", y = "Expression Level") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
}
return(p)
}
# Example usage
# BoxPlotFromSeurat(sobj_tiny, features = c("Srm", "Apoe", "Top2a"))
You can try also with the boxplots.
VlnPlot(sobj_tiny,
features = c("Srm", "Apoe", "Top2a"),
layer = "count")
Try to do an histogram, with 1) the raw counts values and 2) the log
normalized values. You can use the R hist()
function. What
do you observe ? You will need to extract your data with a command such
as :
raw_counts <- as.vector(as.matrix(sobj_tiny@assays$RNA$count))
# Raw values
raw_counts <- as.vector(as.matrix(sobj_tiny@assays$RNA$count))
hist(raw_counts)
# Log Norm values
logNorm_counts <- as.vector(as.matrix(sobj_tiny@assays$RNA$data))
hist(logNorm_counts)
# If you want to visualize the counts without the zero values, add these lines :
raw_counts <- raw_counts[raw_counts > 0]
logNorm_counts <- logNorm_counts[logNorm_counts > 0]
# And then again :
hist(raw_counts)
hist(logNorm_counts)
# You can play with the "breaks" parameters to have a better visualization :
hist(raw_counts, breaks = 100)
hist(logNorm_counts, breaks = 100)
We can also plot the distribution of all genes in our dataset, and visualize it with some boxplots.
# Prepare the data
raw_df <- data.frame(Expression = raw_counts, Type = "Raw Counts")
norm_df <- data.frame(Expression = logNorm_counts, Type = "Normalized Counts")
plot_data <- rbind(raw_df, norm_df)
# Create the boxplot
boxplot_plot <- ggplot(plot_data, aes(x = Type, y = Expression, fill = Type)) +
geom_boxplot(outlier.shape = NA) +
scale_y_log10() +
ggtitle("Boxplots of Gene Expression Across All Genes (Before and After Normalization)") +
xlab("Normalization Status") +
ylab("Expression (Log10 Scale)") +
theme_minimal()
# Print the boxplot
print(boxplot_plot)
How do we know that we effectively normalized the data by reading the boxplots ?
Until now, we explored the variability among the genes count values. We can do the same with the cells :
# Prepare the data
df <- as.data.frame(sobj_tiny[[]])
df$cell_index <- rownames(df)
# Plots before and after normalization
ggplot(df, aes(x = cell_index, y = nCount_RNA)) + geom_bar(stat="identity")
ggplot(df, aes(x = cell_index, y = log10_nCount_RNA)) + geom_bar(stat="identity")
Finally, we want to see the impact of the normalization on a highly variable gene. We chose to focus on “Tmsb10”.
You can use the FeatureScatter
function for this.
FeatureScatter(sobj_tiny,
feature1 = "nCount_RNA",
feature2 = "Tmsb10",
slot = "count")
What can you observe ?
FeatureScatter(sobj_tiny,
feature1 = "nCount_RNA",
feature2 = "Tmsb10",
slot = "data")
Discussion : LogNormalize reduces the impact of sequencing depth differences but may not handle highly variable genes effectively.
SCTransform
SCTransform
is a normalization method designed for
single-cell RNA-seq data. It uses a variance-stabilizing transformation
(VST) to model and remove technical noise, making it particularly
effective for datasets with high variability and technical noise, such
as dropouts and differences in sequencing depth.
SCTransform
is available in the Seurat package and is often
used as an alternative to traditional log normalization.
Key Features of SCTransform :
Variance stabilization: SCTransform aims to stabilize the variance across genes, making it less affected by highly variable or highly expressed genes.
Data modeling: it uses a generalized linear model (GLM) to model the counts as a function of sequencing depth, accounting for technical noise.
No manual scaling needed: SCTransform automatically standardizes the data, so additional scaling is not required before PCA.
# Apply SCTransform
sobj_tiny <- SCTransform(sobj_tiny,
verbose = FALSE)
# Visualize the effect of SCTransform
VlnPlot(sobj_tiny,
features = c("Srm", "Apoe", "Top2a"),
log = TRUE)
Here is an example of the results on the Violin Plots :
# Normalize data using LogNormalize (for comparison)
sobj_tiny_log <- NormalizeData(sobj_tiny, normalization.method = "LogNormalize")
sobj_tiny_log <- ScaleData(sobj_tiny_log)
# Normalize data using SCTransform
sobj_tiny_sct <- SCTransform(sobj_tiny, verbose = FALSE)
# Compare the distributions of gene expression using violin plots
vln_log <- VlnPlot(sobj_tiny_log, features = c("Srm", "Apoe", "Top2a"), log = TRUE)
vln_sct <- VlnPlot(sobj_tiny_sct, features = c("Srm", "Apoe", "Top2a"))
# Display the plots side by side
vln_log
vln_sct
Discussion: SCTransform handles variability better, making it suitable for noisy datasets.
Until now, we have talked a lot about the normalization but what about the scaling ??? Well, it is just an additional step that we use to standardize the expression values across all cells.
In single-cell RNA-seq, scaling typically refers to standardizing the expression of each gene across all cells (not across genes !). The goal is to make the expression of each gene comparable across all cells, ensuring that highly expressed genes with large variance don’t dominate the analysis.
It typically involves transforming each gene’s expression values so that they have a mean of 0 and a standard deviation of 1 (z-score transformation). This ensures that each gene contributes equally to downstream analyses, such as PCA, clustering, and UMAP.
If you want a comparison, it’s like converting temperatures from Celsius and Fahrenheit to a common scale so we can compare them easily. Moreover, we scale to ensure that highly expressed genes don’t dominate the analysis (e.g., in PCA or clustering). Without scaling, a small set of genes with very high variance can bias the results.
Please note that not all genes need to be scaled (e.g., marker genes of interest). Be careful also that scaling should be done after normalization, not before.
How it works ?
For each gene, we take the expression values across all cells and :
1- Center them by subtracting the mean expression of the gene
2- Scale them by dividing by the standard deviation of the gene
\[z = \frac{x - \text{mean}}{\text{standard deviation}}\]
Where, \(x\) is the expression value for a particular gene in a specific cell. Mean and standard deviation are calculated across all cells for that gene.
What it is not ? Scaling is not about making the expression of each cell comparable across all genes. That’s what normalization (e.g., LogNormalize, SCTransform) does : adjusting for sequencing depth differences across cells.
How to scale the data in Seurat ? With the function
ScaleData
.
ScaleData
function.What are the arguments you think will be useful for us ?
sobj_tiny@active.assay <- "RNA"
sobj_tiny <- ScaleData(sobj_tiny)
scaled_data
scaled_data <- as.matrix(sobj_tiny@assays$RNA@layers$scale.data)
After scaling, the expression values for each gene will have a mean of 0 and a standard deviation of 1.
You will probably need the R functions rowMeans
and
rowSds
. If you don’t know them, have a look at their help
page.
# Calculate the mean and standard deviation for each gene without using apply()
gene_means <- rowMeans(scaled_data)
gene_sds <- rowSds(scaled_data)
# Check the results
summary(gene_means)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.02466 -0.00290 0.00000 -0.00453 0.00000 0.00000
summary(gene_sds)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.9327 1.0000 0.8859 1.0000 1.0000
Let’s see what happens when we use scaled data for DE analysis. You
can run the analyses with the functions FindMarkers
or
FindAllMarkers
. For this exercise, it will be more useful
if you first run a clustering analyses in order to get identified groups
of cells.
Example :
sobj_tiny <- NormalizeData(sobj_tiny)
sobj_tiny <- FindVariableFeatures(sobj_tiny)
sobj_tiny <- ScaleData(sobj_tiny)
sobj_tiny <- RunPCA(sobj_tiny, dims = 1:20)
sobj_tiny <- FindNeighbors(sobj_tiny, dims = 1:20)
sobj_tiny <- FindClusters(sobj_tiny, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 1000
## Number of edges: 41202
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.7202
## Number of communities: 7
## Elapsed time: 0 seconds
sobj_tiny <- RunUMAP(sobj_tiny, dims = 1:20)
DimPlot(sobj_tiny, reduction = "umap")
You can also run a first differential expression analyses to be able to compare after your results :
# Perform DE analyses
sobj_tiny.markers <- FindAllMarkers(sobj_tiny, only.pos = TRUE)
# Extract top 10 genes per cluster
top10 <- sobj_tiny.markers %>%
group_by(cluster) %>%
dplyr::filter(avg_log2FC > 1) %>%
slice_head(n = 10) %>%
ungroup()
# Display heatmap of DE results
DoHeatmap(sobj_tiny, features = top10$gene) + NoLegend() + theme(text = element_text(size = 7))
## Warning in DoHeatmap(sobj_tiny, features = top10$gene): The following features
## were omitted as they were not found in the scale.data slot for the RNA assay:
## Gm11725, Nln, Notch3, Dennd3, Pdk1, Slc43a1, Tle2, Smoc1, Cep85, Tdrd5,
## BB031773, Plxdc1, Patj
# Display top DE results
head(top10, n = 30)
## # A tibble: 30 × 7
## p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
## 1 0.00000000185 1.68 0.242 0.114 0.0000231 0 Patj
## 2 0.00000000657 1.05 0.374 0.247 0.0000821 0 Plxdc1
## 3 0.0000000760 1.60 0.206 0.096 0.000951 0 Slc6a19
## 4 0.000000922 1.48 0.16 0.068 0.0115 0 BB031773
## 5 0.00000100 1.41 0.204 0.1 0.0125 0 Tdrd5
## 6 0.00000290 1.04 0.364 0.278 0.0362 0 Cep85
## 7 0.00000474 2.02 0.107 0.036 0.0593 0 Smoc1
## 8 0.00000877 1.76 0.115 0.043 0.110 0 Tle2
## 9 0.0000318 1.06 0.318 0.236 0.398 0 Slc43a1
## 10 0.0000360 1.23 0.262 0.181 0.451 0 Pdk1
## # ℹ 20 more rows
# Attempt DE analysis using scaled data
sobj_tiny_scaled.markers <- FindAllMarkers(sobj_tiny, only.pos = TRUE, assay = "RNA", slot = "scale.data")
## Warning: No DE genes identified
## Warning: The following tests were not performed:
## Warning: When testing 0 versus all:
## subscript out of bounds
## Warning: When testing 1 versus all:
## subscript out of bounds
## Warning: When testing 2 versus all:
## subscript out of bounds
## Warning: When testing 3 versus all:
## subscript out of bounds
## Warning: When testing 4 versus all:
## subscript out of bounds
## Warning: When testing 5 versus all:
## subscript out of bounds
## Warning: When testing 6 versus all:
## subscript out of bounds
# Not working !! Scaled data is designed for clustering, dimensionality reduction, and visualizations.
Can you try also with raw counts (not normalized) ? What do you observe ?
markers <- FindAllMarkers(sobj_tiny, only.pos = TRUE, slot = "counts")
top10 <- markers %>%
group_by(cluster) %>%
dplyr::filter(avg_log2FC > 1) %>%
slice_head(n = 10) %>%
ungroup()
# Display heatmap of DE results
DoHeatmap(sobj_tiny, features = top10$gene) + NoLegend() + theme(text = element_text(size = 7))
# Display top DE results
head(markers)
## p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
## mt-Co2 1.493196e-21 0.1506044 0.997 0.995 1.867690e-17 0 mt-Co2
## Eif1 4.006820e-18 0.1422588 0.977 0.992 5.011731e-14 0 Eif1
## Cd3d 1.441059e-17 0.1042215 0.936 0.979 1.802476e-13 0 Cd3d
## Serf2 1.783625e-15 0.1231241 0.952 0.980 2.230958e-11 0 Serf2
## Laptm5 2.012218e-15 0.1146634 0.944 0.959 2.516882e-11 0 Laptm5
## mt-Atp8 3.073331e-15 0.1966736 1.000 0.995 3.844122e-11 0 mt-Atp8
Discussion : * Why might the DE results be non-significant or unexpected ? * What information is lost during scaling that is critical for DE analysis ?
features
optionThis value is set by default to 2000. It represents the number of
highly variable genes to take into account. You can explore the impact
it has on the PCA. Hint : use the RunPCA
function from
Seurat.
To answer this question, you will need a function to reinitialize your Seurat object as follows :
# Function to reinitialize the Seurat object
reinitialize_sobj <- function(seurat_obj) {
# Copy the object
seurat_copy <- seurat_obj
# Remove variable features
seurat_copy@assays$RNA@meta.data <- data.frame()
# Remove normalized and scales data
seurat_copy@assays$RNA@layers$data <- NULL
seurat_copy@assays$RNA@layers$scale.data <- NULL
# Remove PCA and UMAP
seurat_copy@reductions <- list()
# Remove SCT
seurat_copy@assays$SCT <- NULL
# Set active assay to RNA
seurat_copy@active.assay <- "RNA"
return(seurat_copy)
}
# Function to perform analysis with a specified number of variable features
perform_analysis <- function(seurat_obj, num_features) {
# Create a new Seurat object from raw counts
seurat_copy <- reinitialize_sobj(seurat_obj)
# Normalize the data
seurat_copy <- NormalizeData(seurat_copy)
# Set the variable features (up to the specified number)
seurat_copy <- FindVariableFeatures(seurat_copy,
selection.method = "vst",
nfeatures = num_features)
# Identify the 10 most highly variable genes
top_features <- head(VariableFeatures(seurat_copy), 20)
print(top_features)
# Scale the data, and run PCA
seurat_copy <- ScaleData(seurat_copy)
seurat_copy <- RunPCA(seurat_copy,
features = VariableFeatures(seurat_copy),
verbose = FALSE)
# Generate PCA plot
pca_plot <- DimPlot(seurat_copy, reduction = "pca") + ggtitle(paste(num_features, "Variable Features"))
# Generate DimHeatmap for the first two principal components
DimHeatmap(seurat_copy,
dims = 1:2,
balanced = TRUE,
reduction = "pca")
# Cluster the cells and run UMAP
seurat_copy <- FindNeighbors(seurat_copy, dims = 1:20)
seurat_copy <- FindClusters(seurat_copy, resolution = 0.5)
seurat_copy <- RunUMAP(seurat_copy, dims = 1:20)
umap_plot <- DimPlot(seurat_copy, reduction = "umap")
# Find markers
seurat_copy.markers <- FindAllMarkers(seurat_copy, only.pos = TRUE)
top10 <- seurat_copy.markers %>%
group_by(cluster) %>%
dplyr::filter(avg_log2FC > 1) %>%
slice_head(n = 10) %>%
ungroup()
heatmap_markers_plot <- DoHeatmap(seurat_copy, features = top10$gene) + NoLegend()
# Generate Variable Feature Plot with labeled top 20 features
vf_plot <- VariableFeaturePlot(seurat_copy)
vf_plot <- LabelPoints(plot = vf_plot, points = top_features, repel = TRUE) +
ggtitle(paste("Top Variable Features:", num_features))
# Print your new Seurat object
print(seurat_copy)
# Return a list of plots
return(list(
SeuratObject = seurat_copy,
PCA = pca_plot,
VariableFeatures = vf_plot,
UMAP_plot = umap_plot,
Heatmap_markers = heatmap_markers_plot))
}
# Analyze with different numbers of variable features
results_100 <- perform_analysis(sobj_tiny, 100)
## [1] "Hist1h2ao" "Pclaf" "Hist1h2ap" "Rrm2" "Top2a" "Hist1h2ae"
## [7] "Hmgb2" "Ms4a4b" "Tuba1b" "Itm2a" "Birc5" "Trbv16"
## [13] "Ifi27l2a" "Trbv29" "Trbv15" "Trbv5" "H2-K1" "Nkg7"
## [19] "Ube2c" "Hist1h1b"
## Warning in irlba(A = t(x = object), nv = npcs, ...): You're computing too large
## a percentage of total singular values, use a standard svd instead.
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 1000
## Number of edges: 34398
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8030
## Number of communities: 14
## Elapsed time: 0 seconds
## Warning in DoHeatmap(seurat_copy, features = top10$gene): The following
## features were omitted as they were not found in the scale.data slot for the RNA
## assay: Tyrobp, Vcam1, Ear2, C1qb, Lgals3, C1qc, Cd74, Fcer1g, C1qa, Rflnb, Cd7,
## Gimap4, H2-Q7, Gm2682, S1pr1, Itgae, Tmc4, Gm20517, Trdj1, Ephb6, Trbj2-1,
## Cyp11a1, Trbj1-5, Traj2, Gm49355, Sdr42e1, Comp, Gabrd, Sdcbp2, Paqr8,
## C920008G01Rik, E2f8, Clspn, Rad51, Tmie, Pcsk1, Grik1, Gm30373, Zc2hc1a, Fbln1,
## Trbv13-2, Lypd1, Cd69, Cd40lg, Tigit, Nab2, Egr2, 0610005C13Rik, Sbsn, Uckl1os,
## 1700029I15Rik, Catspere2, 4732491K20Rik, Gm29019, Trav3-4, Gm44731, Zfp316,
## Gm43062, Gm15417, Tyro3, Gm38037, Myo5b, Slc9a3r2, Slit3, Trbj2-2, Gm17349,
## Gm34552, Gm30906, Trav6-2, Ubap1l, A830035O19Rik, Gm12840, Isl2, Star, Eif2s3y,
## Gm26526, Trav6-4, Mgam, Trim80, Cabyr, Tes, Casp9, Pam, Gm45623, Gm13920,
## 2810414N06Rik, Gm13919, Tpbgl, Trbj2-4, Fbln2, Ddx3y, 5830405F06Rik, Gm33104,
## Trbv24, Pde4dip, Tle2, Trbv23, Trbv21
## An object of class Seurat
## 12508 features across 1000 samples within 1 assay
## Active assay: RNA (12508 features, 100 variable features)
## 3 layers present: counts, data, scale.data
## 2 dimensional reductions calculated: pca, umap
results_2000 <- perform_analysis(sobj_tiny, 2000)
## [1] "Hist1h2ao" "Pclaf" "Hist1h2ap" "Rrm2" "Top2a" "Hist1h2ae"
## [7] "Hmgb2" "Ms4a4b" "Tuba1b" "Itm2a" "Birc5" "Trbv16"
## [13] "Ifi27l2a" "Trbv29" "Trbv15" "Trbv5" "H2-K1" "Nkg7"
## [19] "Ube2c" "Hist1h1b"
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 1000
## Number of edges: 41202
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.7202
## Number of communities: 7
## Elapsed time: 0 seconds
## Warning in DoHeatmap(seurat_copy, features = top10$gene): The following
## features were omitted as they were not found in the scale.data slot for the RNA
## assay: Gm11725, Nln, Notch3, Dennd3, Pdk1, Slc43a1, Tle2, Smoc1, Cep85, Tdrd5,
## BB031773, Plxdc1, Patj
## An object of class Seurat
## 12508 features across 1000 samples within 1 assay
## Active assay: RNA (12508 features, 2000 variable features)
## 3 layers present: counts, data, scale.data
## 2 dimensional reductions calculated: pca, umap
results_5000 <- perform_analysis(sobj_tiny, 5000)
## [1] "Hist1h2ao" "Pclaf" "Hist1h2ap" "Rrm2" "Top2a" "Hist1h2ae"
## [7] "Hmgb2" "Ms4a4b" "Tuba1b" "Itm2a" "Birc5" "Trbv16"
## [13] "Ifi27l2a" "Trbv29" "Trbv15" "Trbv5" "H2-K1" "Nkg7"
## [19] "Ube2c" "Hist1h1b"
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 1000
## Number of edges: 40708
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.7142
## Number of communities: 6
## Elapsed time: 0 seconds
## Warning in DoHeatmap(seurat_copy, features = top10$gene): The following
## features were omitted as they were not found in the scale.data slot for the RNA
## assay: St8sia1, Ccnd3, Tiparp, Gm44175, Cep85, Plxdc1, Gtf2h4, Arl5c
## An object of class Seurat
## 12508 features across 1000 samples within 1 assay
## Active assay: RNA (12508 features, 5000 variable features)
## 3 layers present: counts, data, scale.data
## 2 dimensional reductions calculated: pca, umap
# Display PCA plots for comparison
print(results_100$PCA)
print(results_2000$PCA)
print(results_5000$PCA)
# Display Variable Feature Plots
print(results_100$VariableFeatures)
## Warning in scale_x_log10(): log-10 transformation introduced infinite values.
## Warning: ggrepel: 4 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
print(results_2000$VariableFeatures)
## Warning in scale_x_log10(): log-10 transformation introduced infinite values.
## ggrepel: 4 unlabeled data points (too many overlaps). Consider increasing max.overlaps
print(results_5000$VariableFeatures)
## Warning in scale_x_log10(): log-10 transformation introduced infinite values.
## Warning: ggrepel: 2 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
# Display UMAPs
print(results_100$UMAP_plot)
print(results_2000$UMAP_plot)
print(results_5000$UMAP_plot)
# Display UMAPs
print(results_100$Heatmap_markers)
print(results_2000$Heatmap_markers)
print(results_5000$Heatmap_markers)
# Set all genes as variable features for consistency
#VariableFeatures(sobj_tiny) <- rownames(sobj_tiny)
# Apply SCTransform using all genes
sobj_tiny_sct <- SCTransform(sobj_tiny,
verbose = FALSE,
variable.features.n = 2000)
# Normalize and scale using LogNormalize + ScaleData for all genes
sobj_tiny@active.assay <- "RNA"
sobj_tiny_log <- NormalizeData(sobj_tiny, normalization.method = "LogNormalize", scale.factor = 10000)
sobj_tiny_log <- FindVariableFeatures(sobj_tiny_log)
sobj_tiny_log <- ScaleData(sobj_tiny_log, features = VariableFeatures(sobj_tiny_log))
# Extract the scaled data matrices for comparison
sct_data <- as.matrix(sobj_tiny_sct@assays$SCT@scale.data)
scaled_data <- as.matrix(sobj_tiny_log@assays$RNA@layers$scale.data)
# Check if the dimensions are now consistent
dim(sct_data)
## [1] 2000 1000
dim(scaled_data)
## [1] 2000 1000
# Calculate mean and standard deviation for SCTransform
sct_means <- rowMeans(sct_data)
sct_sds <- rowSds(sct_data)
# Calculate mean and standard deviation for ScaleData
scaled_means <- rowMeans(scaled_data)
scaled_sds <- rowSds(scaled_data)
# Create a data frame for comparison
comparison_df <- data.frame(
Gene = rownames(sct_data),
SCT_Mean = sct_means,
SCT_SD = sct_sds,
Scaled_Mean = scaled_means,
Scaled_SD = scaled_sds
)
summary(comparison_df$SCT_SD)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3781 1.0342 1.0687 1.0704 1.1125 2.4202
summary(comparison_df$Scaled_SD)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3563 0.9531 1.0000 0.9340 1.0000 1.0000
# Plot histograms of the standard deviations for both methods
hist_plot <- ggplot(comparison_df, aes(x = SCT_SD)) +
geom_histogram(bins = 30, fill = "blue", alpha = 0.5) +
geom_histogram(aes(x = Scaled_SD), bins = 30, fill = "red", alpha = 0.5) +
ggtitle("Comparison of Standard Deviations: SCTransform vs. ScaleData") +
xlab("Standard Deviation") +
ylab("Frequency") +
theme_minimal()
summary(comparison_df$Scaled_Mean)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.024654 -0.004104 0.000000 -0.003902 0.000000 0.000000
summary(comparison_df$SCT_Mean)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -9.015e-17 -8.882e-18 1.874e-19 7.455e-19 9.770e-18 3.025e-16
# Scatter plot comparing mean expression values
scatter_plot <- ggplot(comparison_df, aes(x = Scaled_Mean, y = SCT_Mean)) +
geom_point(alpha = 0.5) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
ggtitle("Scatter Plot of Mean Expression: ScaleData vs. SCTransform") +
xlab("Mean Expression (ScaleData)") +
ylab("Mean Expression (SCTransform)") +
theme_minimal()
# Display the plots
hist_plot
scatter_plot
In single-cell RNA-seq, regressing out variables (e.g., mitochondrial content, cell cycle effects) removes the influence of those variables on gene expression. The goal is to mitigate technical or unwanted biological variability, leaving a dataset that better reflects the biological phenomena of interest.
In Seurat, you can use the ScaleData
function to remove
unwanted sources of variation. For example, we could ‘regress out’
heterogeneity associated with cell cycle stage (“CC_Seurat_Phase”), or
mitochondrial contamination (“percent.mt”). This adjusts each gene’s
expression by fitting a linear model for the specified variables and
replacing the values with the residuals. The command to regress out
mitochondrial contamination would be :
sobj_tiny <- ScaleData(sobj_tiny, vars.to.regress = "percent.mt")
SCTransform
also allows to do a regression. If you run
SCTransform
, you don’t have to then run
ScaleData
as it’s all included in the
SCTransform
function. We can specify the variable to
regress out like this :
sobj_tiny <- SCTransform(sobj_tiny, vars.to.regress = "percent.mt")
You can take these 3 genes that we used before (“Srm”, “Apoe”,
“Top2a”), or any other of your choice. In the correction, we use “Cdk1”.
Use the function FeatureScatter
to easily visualize the
effect.
feature <- "Cdk1"
sobj_tiny@active.assay <- "RNA"
sobj_tiny <- NormalizeData(sobj_tiny)
sobj_tiny <- FindVariableFeatures(sobj_tiny)
# Scatter plot before regression
sobj_tiny <- ScaleData(sobj_tiny)
sobj_tiny$feature_value <- as.vector(LayerData(sobj_tiny, assay = "RNA", layer = "scale.data")[feature,]) # copy the values of the feature in the metadata
FeatureScatter(sobj_tiny, feature1 = "CC_Seurat_G2M.Score", feature2 = "feature_value", slot = "scale.data")
FeatureScatter(sobj_tiny, feature1 = "CC_Seurat_S.Score", feature2 = "feature_value", slot = "scale.data")
# Regress out cell cycle phase
sobj_tiny_cc_reg <- ScaleData(sobj_tiny, vars.to.regress = "CC_Seurat_Phase")
sobj_tiny_cc_reg$feature_value <- as.vector(LayerData(sobj_tiny_cc_reg, assay = "RNA", layer = "scale.data")[feature,]) # copy the values of the feature in the metadata
FeatureScatter(sobj_tiny_cc_reg, feature1 = "CC_Seurat_G2M.Score", feature2 = "feature_value", slot = "scale.data")
FeatureScatter(sobj_tiny_cc_reg, feature1 = "CC_Seurat_S.Score", feature2 = "feature_value", slot = "scale.data")
# Before regression
sobj_tiny <- RunPCA(sobj_tiny, verbose = FALSE)
DimHeatmap(sobj_tiny, dims = 1:2, balanced = TRUE, reduction = "pca")
# After regression
sobj_tiny <- RunPCA(sobj_tiny_cc_reg, verbose = FALSE)
DimHeatmap(sobj_tiny_cc_reg, dims = 1:2, balanced = TRUE, reduction = "pca")
Try to find genes that are highly affected by the regression, or other that expression do not move after regression.
Take several pipelines that we have seen, and evaluate how it impacts the results of the differentially expressed genes.
For example, you could compare :
Congrats ! You’ve made it through your first experience with normalization, scaling and regression. Here are a few takeaway messages :
sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.4
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Europe/Paris
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] dplyr_1.1.4 scater_1.30.1
## [3] scuttle_1.12.0 SingleCellExperiment_1.24.0
## [5] SummarizedExperiment_1.32.0 Biobase_2.62.0
## [7] GenomicRanges_1.54.1 GenomeInfoDb_1.38.8
## [9] IRanges_2.36.0 S4Vectors_0.40.2
## [11] BiocGenerics_0.48.1 MatrixGenerics_1.14.0
## [13] matrixStats_1.4.0 patchwork_1.2.0
## [15] ggplot2_3.5.1 Seurat_5.1.0
## [17] SeuratObject_5.0.2 sp_2.1-4
##
## loaded via a namespace (and not attached):
## [1] RcppAnnoy_0.0.22 splines_4.3.1
## [3] later_1.3.2 bitops_1.0-8
## [5] tibble_3.2.1 polyclip_1.10-7
## [7] fastDummies_1.7.4 lifecycle_1.0.4
## [9] globals_0.16.3 lattice_0.22-5
## [11] MASS_7.3-60 magrittr_2.0.3
## [13] limma_3.58.1 plotly_4.10.4
## [15] sass_0.4.9 rmarkdown_2.28
## [17] jquerylib_0.1.4 yaml_2.3.10
## [19] httpuv_1.6.15 sctransform_0.4.1
## [21] spam_2.10-0 spatstat.sparse_3.1-0
## [23] reticulate_1.39.0 cowplot_1.1.3
## [25] pbapply_1.7-2 RColorBrewer_1.1-3
## [27] abind_1.4-5 zlibbioc_1.48.2
## [29] Rtsne_0.17 purrr_1.0.2
## [31] RCurl_1.98-1.16 GenomeInfoDbData_1.2.11
## [33] ggrepel_0.9.5 irlba_2.3.5.1
## [35] listenv_0.9.1 spatstat.utils_3.1-0
## [37] goftest_1.2-3 RSpectra_0.16-2
## [39] spatstat.random_3.3-1 fitdistrplus_1.2-1
## [41] parallelly_1.38.0 DelayedMatrixStats_1.24.0
## [43] leiden_0.4.3.1 codetools_0.2-19
## [45] DelayedArray_0.28.0 tidyselect_1.2.1
## [47] farver_2.1.2 ScaledMatrix_1.10.0
## [49] viridis_0.6.5 spatstat.explore_3.3-2
## [51] jsonlite_1.8.8 BiocNeighbors_1.20.1
## [53] progressr_0.14.0 ggridges_0.5.6
## [55] survival_3.5-7 tools_4.3.1
## [57] ica_1.0-3 Rcpp_1.0.13
## [59] glue_1.7.0 gridExtra_2.3
## [61] SparseArray_1.2.2 xfun_0.49
## [63] withr_3.0.1 fastmap_1.2.0
## [65] fansi_1.0.6 digest_0.6.37
## [67] rsvd_1.0.5 R6_2.5.1
## [69] mime_0.12 colorspace_2.1-1
## [71] scattermore_1.2 tensor_1.5
## [73] spatstat.data_3.1-2 utf8_1.2.4
## [75] tidyr_1.3.1 generics_0.1.3
## [77] data.table_1.16.0 httr_1.4.7
## [79] htmlwidgets_1.6.4 S4Arrays_1.2.0
## [81] uwot_0.1.16 pkgconfig_2.0.3
## [83] gtable_0.3.5 lmtest_0.9-40
## [85] XVector_0.42.0 htmltools_0.5.8.1
## [87] dotCall64_1.1-1 scales_1.3.0
## [89] png_0.1-8 spatstat.univar_3.0-1
## [91] knitr_1.48 rstudioapi_0.15.0
## [93] reshape2_1.4.4 nlme_3.1-164
## [95] cachem_1.1.0 zoo_1.8-12
## [97] stringr_1.5.1 KernSmooth_2.23-22
## [99] parallel_4.3.1 miniUI_0.1.1.1
## [101] vipor_0.4.7 ggrastr_1.0.2
## [103] pillar_1.9.0 grid_4.3.1
## [105] vctrs_0.6.5 RANN_2.6.2
## [107] promises_1.3.0 BiocSingular_1.18.0
## [109] beachmat_2.18.0 xtable_1.8-4
## [111] cluster_2.1.6 beeswarm_0.4.0
## [113] evaluate_0.24.0 cli_3.6.3
## [115] compiler_4.3.1 rlang_1.1.4
## [117] crayon_1.5.3 future.apply_1.11.2
## [119] labeling_0.4.3 plyr_1.8.9
## [121] ggbeeswarm_0.7.2 stringi_1.8.4
## [123] viridisLite_0.4.2 deldir_2.0-4
## [125] BiocParallel_1.36.0 munsell_0.5.1
## [127] lazyeval_0.2.2 spatstat.geom_3.3-2
## [129] Matrix_1.6-4 RcppHNSW_0.6.0
## [131] sparseMatrixStats_1.14.0 future_1.34.0
## [133] statmod_1.5.0 shiny_1.9.1
## [135] highr_0.11 ROCR_1.0-11
## [137] igraph_2.0.3 bslib_0.8.0