Forewords

I need three teams for this session: team Wilcoxon, team Student, and team ROC.

I will also need your help because I can’t make DESeq2 work correctly. But I’m sure, that we will solve my issue: you’re in the best session here at EBAII.

TLDR: R command lines

In this presentation, there will be screen captures for you to follow the lesson. There will also be every single R command lines. Do not take care of the command lines if you find them too challenging. Our goal here, is to understand the main mechanism of Differential Expression Analysis. R is just a tool.

Below are the libraries we need to perform this whole session:

base::library(package = "BiocParallel")    # Optionally multithread some steps
base::library(package = "DT")              # Display nice table in HTML
base::library(package = "ggplot2")         # Draw graphs and plots
base::library(package = "ggpubr")          # Draw nicer graphs
base::library(package = "rstatix")         # Base R statistics
base::library(package = "knitr")           # Build this presentation
base::library(package = "dplyr")           # Handle big tables
base::library(package = "Seurat")          # Handle SingleCell analyses
base::library(package = "SeuratObject")    # Handle SingleCell objects
base::library(package = "SingleCellExperiment") # Handle SingleCell file formats
base::library(package = "UpSetR")          # Nice venn-like graphs
base::library(package = "EnhancedVolcano") # Draw Volcano plot

First, we load Seurat object:

sobj <- base::readRDS(
  # Path to the RDS file
  file = "Scaled_Normalized_Seurat_Object.RDS"
)

Then we perform differential expression analysis:

sobj_de <- Seurat::FindMarkers(
  # Object on which to perform DEA
  object = sobj,
  # Name of factor in condition 1
  ident.1 = "TD3A",
  # Name of factor in condition 2
  ident.2 = "TDCT"
)

And that’s all. Our goal is to understand these lines, being able to write them is a bonus.

Purpose of this session

Up to now, we have:

  1. Identified to which cell each sequenced reads come from
  2. Identified to which gene each read come from
  3. Identified possible bias in gene expression for each cell
  4. Filtered and corrected these bias as well as we can

We would like to identify the list of genes that characterize differences between cell cycle phases G1 and S groups.

At the end of this session you will know:

  1. how to select a differential analysis method
  2. how to select the correct normalization (if any?) that must be provided to your differential analysis method
  3. How to read differential expression results

Load RDS dataset

You already have a dataset loaded ? Good. Keep on going with it ! You don’t have one ? Use mine:

# We store in the variable `sobj`, the result of
# the function `readRDS` from package `base`.
sobj <- base::readRDS(
  # Path to the RDS file
  file = "Scaled_Normalized_Seurat_Object.RDS"
)

The command above, uses the function readRDS from base R package. and git it the path to the RDS file we built in previous session.

Insight

We are wondering what’s in our dataset. Let’s have a look, with the function print form the package base.

base::print(
  # Object to display
  x = sobj
)
An object of class Seurat 
6941 features across 3984 samples within 1 assay 
Active assay: RNA (6941 features, 2000 variable features)

We have 6941 features (aka genes), accross 3984 samples (aka cells).

Let us have a look at the RNA counts for 10 cells and their annotation, with the function head from the package utils.

utils::head(
  # Object to visualize
  x = sobj,
  # Number of lines to display
  n = 10
)

There is no counts, normalized or not. Where are they ?

In order to explore the content of sobj, use the function str from the package utils:

utils::str(
  # Object to explore
  object = sobj
)

Alternatively, in RStudio, you can click on the object pane and explore manually the content of the object. If we explore the slot assays, then we find the counts.

You can access them with:

utils::head(
  # Object to visualize
  x = sobj@assays[["RNA"]]@counts,
  # Number of rows to display
  n = 10
)

We have one gene per line, one cell per column, and RNA counts in each row. Are these counts normalized ? Are they scaled ? Are they filtered ? Are they corrected ?

Answer

These counts are normalized, scaled, filtered. This information is available in the seurat object itself, within the slot commands. See an example below:

names(sobj@commands)
[1] "NormalizeData.RNA"        "FindVariableFeatures.RNA"
[3] "ScaleData.RNA"           

However, please be aware that counts in the slot count are raw counts. Normalized counts are in the slot data and scaled data are in the slot scaled.data. And it you do not find that clear, I totally agree with you.


Is it normal that we have so many zeros ? And what about surch low counts, is it because we downsampled the sequenced reads for this session ?

Answer

The large number of null counts is completely normal. In maths/stats we talk about matrix sparcity, i.e a table with lots of zeros. If the data were to be downsampled, we had had done this by cropping over a small chromosome, and not reducing the general amount of reads.


Select a DE method

Available methods

Seurat let us use multiple differential analysis methods with its function FindMarkers.

  1. wilcox: The wilcoxon test tests the mean of expression and looks for a difference in these means.
  2. MAST: This tool has been built for Single Cell. It is based on a statistical model called “Hurdle Model”, which excells with data that contains lots of zeros (which is our case in Single Cell RNA-Seq: most of the genes are not expressed.)
  3. DESeq2: This tool has originally been built for bulk RNA-Seq but now includes specific funcitons for Single Cell. It performs well when counts are highly variable or when you wand to compare a handful of cells.
  4. t-test: The t-test uses a comparison of means to assess differential expression probability.
  5. roc: An AUC value of 1 means that expression values for this gene alone can perfectly classify the two groupings (i.e. Each of the cells in cells.1 exhibit a higher level than each of the cells in cells.2). An AUC value of 0 also means there is perfect classification, but in the other direction. A value of 0.5 implies that the gene has no predictive power to classify the two groups.

The main question now is how to choose the right test: spoilers, there are no option better than another in all ways.

From Soneson & Robinson (2018) Nature Methods: