I need three teams for this session: team Wilcoxon
,
team Student
,
and team ROC
.
I will also need your help because I can’t make DESeq2 work correctly. But I’m sure, that we will solve my issue: you’re in the best session here at EBAII.
In this presentation, there will be screen captures for you to follow the lesson. There will also be every single R command lines. Do not take care of the command lines if you find them too challenging. Our goal here, is to understand the main mechanism of Differential Expression Analysis. R is just a tool.
Below are the libraries we need to perform this whole session:
base::library(package = "BiocParallel") # Optionally multithread some steps
base::library(package = "DT") # Display nice table in HTML
base::library(package = "ggplot2") # Draw graphs and plots
base::library(package = "ggpubr") # Draw nicer graphs
base::library(package = "rstatix") # Base R statistics
base::library(package = "knitr") # Build this presentation
base::library(package = "dplyr") # Handle big tables
base::library(package = "Seurat") # Handle SingleCell analyses
base::library(package = "SeuratObject") # Handle SingleCell objects
base::library(package = "SingleCellExperiment") # Handle SingleCell file formats
base::library(package = "UpSetR") # Nice venn-like graphs
base::library(package = "EnhancedVolcano") # Draw Volcano plot
First, we load Seurat object:
Then we perform differential expression analysis:
sobj_de <- Seurat::FindMarkers(
# Object on which to perform DEA
object = sobj,
# Name of factor in condition 1
ident.1 = "TD3A",
# Name of factor in condition 2
ident.2 = "TDCT"
)
And that’s all. Our goal is to understand these lines, being able to write them is a bonus.
Up to now, we have:
We would like to identify the list of genes that characterize differences between cell cycle phases G1 and S groups.
At the end of this session you will know:
You already have a dataset loaded ? Good. Keep on going with it ! You don’t have one ? Use mine:
# We store in the variable `sobj`, the result of
# the function `readRDS` from package `base`.
sobj <- base::readRDS(
# Path to the RDS file
file = "Scaled_Normalized_Seurat_Object.RDS"
)
The command above, uses the function readRDS
from base
R package. and git it the path to the RDS
file we built in previous session.
We are wondering what’s in our dataset. Let’s have a look, with the
function print
form the package base
.
An object of class Seurat
6941 features across 3984 samples within 1 assay
Active assay: RNA (6941 features, 2000 variable features)
We have 6941 features (aka genes), accross 3984 samples (aka cells).
Let us have a look at the RNA counts for 10 cells and their
annotation, with the function head
from the package utils
.
There is no counts, normalized or not. Where are they ?
In order to explore the content of sobj
, use the
function str
from the package utils
:
Alternatively, in RStudio, you can click on the object pane and
explore manually the content of the object. If we explore the slot
assays
, then we find the counts.
You can access them with:
utils::head(
# Object to visualize
x = sobj@assays[["RNA"]]@counts,
# Number of rows to display
n = 10
)
We have one gene per line, one cell per column, and RNA counts in each row. Are these counts normalized ? Are they scaled ? Are they filtered ? Are they corrected ?
These counts are normalized, scaled, filtered. This information is
available in the seurat object itself, within the slot
commands
. See an example below:
[1] "NormalizeData.RNA" "FindVariableFeatures.RNA"
[3] "ScaleData.RNA"
However, please be aware that counts in the slot
count
are raw counts. Normalized counts are in the slot
data
and scaled data are in the slot
scaled.data
. And it you do not find that clear, I totally
agree with you.
Is it normal that we have so many zeros ? And what about surch low counts, is it because we downsampled the sequenced reads for this session ?
The large number of null counts is completely normal. In maths/stats we talk about matrix sparcity, i.e a table with lots of zeros. If the data were to be downsampled, we had had done this by cropping over a small chromosome, and not reducing the general amount of reads.
Seurat let
us use multiple differential analysis methods with its function FindMarkers
.
The main question now is how to choose the right test: spoilers, there are no option better than another in all ways.
From Soneson & Robinson (2018) Nature Methods: