Introduction

This practical session is designed to help you understand the importance of normalization, scaling, and regression in single-cell RNA-seq analysis. You will apply these steps using different methods and visualize the effects on the data.

Input data: Seurat object containing the filtered count matrix from the previous class. It’s called 05_TD3A_S5_Doublets.Filtered_12508.4035.RDS.

Output data: Seurat object after the normalization, scaling and regression. It’s called sobj_TD3A_normalized.rds.

Learning Objectives

  • Discuss why normalizing counts is necessary to compare cells
  • Be able to understand different normalization approaches
  • Be able to decide when to regress out a given variable, by evaluating the effects from any unwanted sources of variation

Motivation

Up to this point, we have filtered out low-quality cells, ambient RNA contamination, and doublets. The data is now a count matrix with cells as columns and genes as rows. These counts reflect the capture, reverse transcription, and sequencing steps of the scRNA-seq experiment, each introducing variability. This means the observed differences in gene expression may be influenced by sampling noise rather than true biological differences, resulting in uneven variance across the dataset.

Normalization addresses this by adjusting raw counts to minimize the effects of variable sampling, making the data more suitable for analysis, which often assumes uniform variance. Various normalization techniques exist, each tailored to support different downstream analyses.

A recent study by Ahlmann-Eltze and Huber (2023) benchmarked 22 normalization methods. It’s essential to select the normalization method based on the specific analysis goals. However, for this practical we will just explore the most well-known way to doing single-cell normalization.

Setup

We first load the packages that are needed :

library(Seurat)
library(ggplot2)
library(patchwork)
library(scater)
library(dplyr)

We then import the data that have been previously filtered (reminder : we removed low quality cells, removed ambient RNA and scored doublets) :

sobj <- readRDS("input/05_TD3A_S5_Doublets.Filtered_12508.4035.RDS")
sobj
## An object of class Seurat 
## 12508 features across 4035 samples within 1 assay 
## Active assay: RNA (12508 features, 0 variable features)
##  2 layers present: counts, data

For the sake of this practical session, we will use a smaller dataset which contains only 1000 cells. In order to do this, we can use the Seurat objet as a dataframe, where rows are genes, and columns are cells. In R language, this corresponds to : sobj[genes,cells].

We will take the first 1000 cells, by using the range 1:1000, like this :

# Select the cells to keep 
sobj_tiny <- sobj[,1:1000]
# Check that the object contains 1000 cells
sobj_tiny
## An object of class Seurat 
## 12508 features across 1000 samples within 1 assay 
## Active assay: RNA (12508 features, 0 variable features)
##  2 layers present: counts, data

Exercices 1 : explore NormalizeData function

Let’s start by exploring the NormalizeData function.

What is the input of this function ?

Answers

The input is your Seurat object (for us : sobj_tiny)

How many normalization methods are they ?

Can you spot which is the default one ?

Answers

There are 3 normalization approaches included in Seurat : LogNormalize, CLR and RC. The default one is LogNormalize.

Run the NormalizeData function on the sobj_tiny

You can now explore your new object : where are the normalized values stored ? Can you print the values of the first 4 genes in the 10 first cells ?

Answers
sobj_tiny <- NormalizeData(sobj_tiny)

# The normalized values are stored in the slot 'data' : 
sobj_tiny@assays$RNA$data[1:4, 1:10]
## 4 x 10 sparse Matrix of class "dgCMatrix"
##                                                                          
## Mrpl15  . 2.365796 .        .        .        .        .        0.6103322
## Lypla1  . .        .        1.844929 .        1.627274 1.442936 0.6103322
## Gm37988 . .        .        .        .        .        .        .        
## Tcea1   . .        1.781829 .        1.708152 1.627274 2.010388 .        
##                          
## Mrpl15  .        .       
## Lypla1  .        1.052803
## Gm37988 .        .       
## Tcea1   1.701157 1.052803
# You can compare it to the slot 'count' (raw count) :
sobj_tiny@assays$RNA$count[1:4, 1:10]
## 4 x 10 sparse Matrix of class "dgCMatrix"
##                            
## Mrpl15  . 2 . . . . . 1 . .
## Lypla1  . . . 1 . 1 1 1 . 1
## Gm37988 . . . . . . . . . .
## Tcea1   . . 1 . 1 1 2 . 1 1

Exercises 2 : let’s test some different normalization strategies

Method 1: LogNormalize

LogNormalize method or “scaling normalization” is the simplest and most commonly used class of normalization approaches. This involves dividing all counts for each cell by a cell-specific scaling factor, often called a “size factor” (Anders and Huber 2010). The assumption here is that any cell-specific bias (e.g., in capture or amplification efficiency) affects all genes equally via scaling of the expected mean count for that cell. The size factor for each cell represents the estimate of the relative bias in that cell, so division of its counts by its size factor should remove that bias. The resulting “normalized expression values” can then be used for downstream analyses such as clustering and dimensionality reduction. (From https://bioconductor.org/books/3.14/OSCA.basic/normalization.html)

# Apply LogNormalize
sobj_tiny <- NormalizeData(sobj_tiny,
                          normalization.method = "LogNormalize",
                          scale.factor = 10000)

We can now visualize the effect of LogNormalize with some violin plots :

VlnPlot(sobj_tiny,
        features = c("Srm", "Apoe", "Top2a"),
        layer = "data")

Can you explain to your neighbor what does this plot mean ?

Violin plots are sometimes hard to read. Here is a function to build box plots out of Seurat objects :

BoxPlotFromSeurat <- function(seurat_obj, features, log = TRUE, layer = "data", plot.title = "Box Plot") {
  # Extract the data for the features of interest
  df <- Seurat::FetchData(seurat_obj, vars = features, layer = layer)
  
  # Add metadata for grouping if group.by is specified
  df <- cbind(df, seurat_obj[[]])
  
  # Reshape the data for ggplot2
  df_long <- tidyr::gather(df, key = "feature", value = "expression", group)
  df_long <- df %>% group_by_(group.by) %>%
    tidyr::pivot_longer(cols = features)

  # Create box plots using ggplot2
  if (log == TRUE) {
    p <- ggplot(df_long, ggplot2::aes(x = name, y = log10(value))) +
    geom_boxplot(outlier.shape = NA) +
    theme_minimal() +
    labs(x = "Feature", y = "Expression Level (log10)") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
  } else {
    p <- ggplot(df_long, ggplot2::aes(x = name, y = value)) +
    geom_boxplot(outlier.shape = NA) +
    theme_minimal() +
    labs(x = "Feature", y = "Expression Level") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
  }
  return(p)
}

# Example usage
# BoxPlotFromSeurat(sobj_tiny, features = c("Srm", "Apoe", "Top2a"))

Can you do the same Violin plots but with the raw counts ?

You can try also with the boxplots.

Answers
VlnPlot(sobj_tiny,
        features = c("Srm", "Apoe", "Top2a"), 
        layer = "count")