---
title: "
EB3I n1 2025 scRNAseq
-
Differential expression analysis"
date: "2025-16-21.22"
author:
- name: "EBAII n1 scRNAseq Team"
- name: "Thibault DAYRIS"
email: "thibault.dayris@gustaveroussy.fr"
- name: "Bastien JOB"
email: "bastien.job@gustaveroussy.fr"
- name: "William JARASSIER"
email: "w.jarassier@gmail.com"
output:
rmdformats::readthedown:
fig_width: 8
fig_height: 6
highlight: tango ## Theme for the code chunks
embed_fonts: TRUE
number_sections: true ## Adds number to headers (sections)
theme: flatly ## CSS theme for the HTML page
collapsed: true ## By default, the TOC is folded
toc_depth: 3
smooth_scroll: true ## Smooth scroll of the HTML page
self_contained: true ## Includes all plots/images within the HTML
code_download: true ## Adds a button to download the Rmd
code_folding: show
thumbnails: false
lightbox: true
fig_caption: false
gallery: true
use_bookdown: true
always_allow_html: true ## Allow plain HTML code in the Rmd
editor_options:
markdown:
wrap: 72
---
```{r knit_setup, echo = FALSE}
knitr::opts_chunk$set(
echo = TRUE, # Print the code
eval = TRUE, # Run command lines
message = FALSE, # Print messages
prompt = FALSE, # Do not display prompt
comment = NA, # No comments on this section
warning = FALSE, # Display warnings
tidy = FALSE,
fig.align="center",
# results = 'hide',
width = 100 # Number of characters per line
)
```
```{css, echo=FALSE}
.notrun {
background-color: lightgrey !important;
border: 3px solid black !important;
}
.notruno {
background-color: lightgrey !important;
color : black !important;
}
.question {
background-color: aquamarine !important;
color : black !important;
border: 3px solid limegreen !important;
}
.questiono {
background-color: aquamarine !important;
color : black !important;
}
.answer {
background-color: navajowhite !important;
border: 3px solid brown !important;
}
.answero {
background-color: navajowhite !important;
color : black !important;
}
.beyond {
background-color: violet !important;
border: 3px solid purple !important;
}
.beyondo {
background-color: violet !important;
color : black !important;
}
```
```{r knit_hook, echo = FALSE}
hooks = knitr::knit_hooks$get()
hook_foldable = function(type) {
force(type)
function(x, options) {
res = hooks[[type]](x, options)
if (isFALSE(options[[paste0("fold.", type)]])) return(res)
paste0(
"Show ", type, "
\n\n",
res,
"\n\n "
)
}
}
knitr::knit_hooks$set(
output = hook_foldable("output"),
plot = hook_foldable("plot")
)
```
------------------------------------------------------------------------
------------------------------------------------------------------------

------------------------------------------------------------------------
------------------------------------------------------------------------
# Start Rstudio
- Using the [OpenOnDemand/Rstudio cheat
sheet](https://moodle.france-bioinformatique.fr/pluginfile.php/1475/mod_folder/content/0/OoD_R_Rstudio.html),
connect to the [OpenOnDemand
portal](https://ondemand.cluster.france-bioinformatique.fr) and
create a Rstudio session with the right resource requirements.
# Warm-up
- We set common parameters we will use throughout this session :
```{r setparam}
# setparam
## Set your project name
# WARNING : Do not just copy-paste this ! It's MY project name ! Put YOURS !!
project_name <- "ebaii_sc_teachers"
## Control if the project_name exists on the cluster
cat('PATH CHECK : ', dir.exists(paste0('/shared/projects/', project_name)))
## Seed for the RNG
my_seed <- 1337L
```
------------------------------------------------------------------------
------------------------------------------------------------------------
# Prepare the data structure
We will do the same as for former steps, just changing the session name
:
## Main directory
```{r maindir}
#maindir
## Preparing the path
TD_dir <- paste0("/shared/projects/", project_name, "/SC_TD")
## Creating the root directory
# dir.create(path = TD_dir, recursive = TRUE)
## Print the root directory on-screen
print(TD_dir)
```
## Current session
```{r sessiondir}
# sessiondir
## Creating the session (Preproc.2) directory
session_dir <- paste0(TD_dir, "/05_Proc.2")
dir.create(path = session_dir, recursive = TRUE)
## Print the session directory on-screen
print(session_dir)
```
## Input directory
```{r indir}
#indir
## Creating the INPUT data directory
input_dir <- paste0(session_dir, "/DATA")
dir.create(path = input_dir, recursive = TRUE)
## Print the input directory on-screen
print(input_dir)
```
## Output directory
```{r outdir}
#outdir
## Creating the OUTPUT data directory
output_dir <- paste0(session_dir, "/RESULTS")
dir.create(path = output_dir, recursive = TRUE)
## Print the output directory on-screen
print(output_dir)
```
------------------------------------------------------------------------
------------------------------------------------------------------------
# Reload the Seurat Object
- We can reload the object we saved at the former step
```{r dataload}
## dataload
## This is the path to the current EB3I backup
sessionid <- '2538_eb3i_n1_2025'
## The latest Seurat object saved as RDS (name)
sobj_file <- "12_TD3A.TDCT_S5_Integrated_12926.3886.RDS"
## The latest Seurat object saved as RDS (full path)
sobj_path <- paste0(TD_dir,
"/05_DEA/RESULTS/",
sobj_file)
force <- TRUE ## To force a re-download of a Zenodo-hosted backup
local <- FALSE ## To force a loading from a local backup
## In case of error/lost data : force a reload from a Zenodo backup repository
if(force) {
zen_id <- "14035293"
zen_backup_file <- paste0("https://zenodo.org/records/",
zen_id,
"/files/",
sobj_file)
## Recreate the expected path if it does not exist
dir.create(path = dirname(sobj_path), recursive = TRUE)
## Download the file
download.file(url = zen_backup_file,
destfile = sobj_path)
}
## In case of error/lost data : force a reload from a local backup repository
if(local) {
sobj_path <- paste0(
"/shared/projects/", sessionid, "/atelier_scrnaseq/TD/BACKUP/RDS/",
sobj_file)
}
## Load the object
sobj <- readRDS(file = sobj_path)
```
# Forewords
I need three teams for this session:
- Team [`Wilcoxon`](https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test)
- Team [`Student`](https://en.wikipedia.org/wiki/Student%27s_t-test)
- Team [`MAST`]
I will also need your help because I can't make [DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html)
work correctly.
But I'm sure, that we will solve my issue: you're in the best
session here at [EB3I](https://github.com/IFB-ElixirFr/EBAII).
## TLDR: R command lines
In this presentation, there will be screen captures for you to follow the
lesson. There will also be every single R command lines.
Do not take care of the command lines if you find them too challenging.
Our goal here, is to **understand** the main mechanism of Differential
Expression Analysis. R is _just a tool_.
Below are the libraries we need to perform this whole session:
```{r load_libraries, eval=TRUE, echo=TRUE, class.source="notrun", class.output="notruno"}
base::library(package = "BiocParallel") # Optionally multithread some steps
base::library(package = "DT") # Display nice tables in HTML
base::library(package = "ggplot2") # Draw nice graphs and plots
base::library(package = "ggpubr") # Draw even nicer graphs
base::library(package = "rstatix") # Base R statistics
base::library(package = "knitr") # Build this presentation
base::library(package = "dplyr") # Handle big tables
base::library(package = "Seurat") # Handle SingleCell analyses
base::library(package = "SeuratObject") # Handle SingleCell objects for Seurat
base::library(package = "SingleCellExperiment") # Handle SingleCell file formats
base::library(package = "UpSetR") # Nice venn-like graphs
base::library(package = "EnhancedVolcano") # Draw Volcano plot
```
Then we join layers:
```{r join_layers_tldr, eval=FALSE, echo=TRUE, class.source="notrun", class.output="notruno"}
# join_layers_tldr
Seurat::Idents(sobj) <- sobj$orig.ident
sobj <- SeuratObject::JoinLayers(sobj)
```
Then we perform differential expression analysis:
```{r run_dea, eval=FALSE, echo=TRUE, class.source="notrun", class.output="notruno"}
# run_dea
sobj_de <- Seurat::FindMarkers(
# Object on which to perform DEA
object = sobj,
# Name of factor in condition 1
ident.1 = "TD3A",
# Name of factor in condition 2
ident.2 = "TDCT"
)
```
And that's all ! Our goal is to understand these lines,
being able to write them is a bonus.
## Purpose of this session
Up to now, we have:
1. Identified to which cell each sequenced reads come from
2. Identified to which gene each read come from
3. Identified possible bias in gene expression for each cell
4. Filtered and corrected these bias as well as we can
We would like to identify the list of genes that characterize differences
between cell cycle phases G1 and S groups.
At the end of this session you will know:
1. how to select a differential analysis method
2. how to select the correct normalization (if any?) that must be provided to
your differential analysis method
3. How to read differential expression results
## Insight
We are wondering what's in our dataset. Let's have a look,
with the function [`print`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/print)
form the package `base`.
```{r print_seurat, eval=TRUE, echo=TRUE}
# print_seurat
base::print(
# Object to display
x = sobj
)
```
We have `r base::dim(SeuratObject::GetAssayData(object = sobj, assay = "RNA", layer = "data.TD3A"))[1]` features (_aka_ genes),
across `r base::dim(SeuratObject::GetAssayData(object = sobj, assay = "RNA", layer = "data.TD3A"))[2]` samples (_aka_ cells) in the TD3A condition. We have `r base::dim(SeuratObject::GetAssayData(object = sobj, assay = "RNA", layer = "data.TDCT"))[1]` features (_aka_ genes),
across `r base::dim(SeuratObject::GetAssayData(object = sobj, assay = "RNA", layer = "data.TDCT"))[2]` samples (_aka_ cells) in the TD3A condition.
Let us have a look at the RNA counts for 10 cells and their annotation,
with the function [`head`](https://www.rdocumentation.org/packages/utils/versions/3.6.2/topics/head) from the package `utils`.
```{r head_seurat_counts_phase, eval=FALSE, echo=TRUE}
# head_seurat_counts_phase
utils::head(
# Object to visualize
x = sobj,
# Number of lines to display
n = 10
)
```
Prettier with a table in HTML :
```{r head_seurat_counts_phase_dt, eval=TRUE, echo=FALSE}
# head_seurat_counts_phase_dt
tmp <- utils::head(x = sobj, n = 10)
DT::datatable(data = tmp)
```
There are no counts, normalized or not !!
Where are they ?
In order to explore the content of `sobj`, use the function `str` from the package `utils`:
```{r str_seurat_object, eval=FALSE, echo=TRUE}
# str_seurat_object
utils::str(
# Object to explore
object = sobj@assays
)
```
Alternatively, in RStudio, you can click on the object pane and explore manually
the content of the object. If we explore the slot `assays`, then we find the counts.
You can access them with:
```{r head_seurat_count_table, eval=FALSE, echo=TRUE}
# head_seurat_count_table
utils::head(
# Object to visualize
x = SeuratObject::GetAssayData(object = sobj, assay = "RNA", layer = "counts"),
# Number of rows to display
n = 10
)
```
Prettier with a table in HTML :
```{r head_seurat_count_table_dt, eval=TRUE, echo=FALSE}
# head_seurat_count_table_dt
tmp <- utils::head(
x = SeuratObject::GetAssayData(object = sobj, assay = "RNA", layer = "counts"),
10
)
DT::datatable(
data = base::as.data.frame(tmp)[, 1:5]
)
```
We have one gene per line, one cell per column, and RNA counts in each row.
For the sake of this session, we won't compare the whole dataset. It would take
up to 15 minutes to complete. During the rest of this session, we will compare
the Louvain clusters 8 and 10, from our integration with Harmony.
We need to re-annotate layers to do so. This is done in two steps:
1. Redefine [`Idents`](https://www.rdocumentation.org/packages/Seurat/versions/3.1.4/topics/Idents)
in order to be able to call cells by their cluster names.
2. [`JoinLayers`](https://www.rdocumentation.org/packages/SeuratObject/versions/5.0.2/topics/JoinLayers) to have a single count table with both TD3A and TDCT cells from clusters 8 and 10 together.
```{r ident_and_join_layers, eval=TRUE, echo=TRUE}
# ident_and_join_layers
Seurat::Idents(sobj) <- sobj$HarmonyStandalone_clusters
sobj <- SeuratObject::JoinLayers(sobj)
```
Let's check if our cells are joint:
```{r check_joint_layers}
# check_joint_layers
base::print(sobj)
```
**Question** :
- We have one gene per line, one cell per column, and RNA counts in each row.
```{r q_normscalfilt, class.source="question", eval = FALSE}
# q_normscalfilt
## Are these counts normalized ? scaled ? filtered ?
```
```{r a_normscalfilt, class.source = c("fold-hide", "answer"), eval = FALSE}
# a_normscalfilt
## . These counts are normalized, scaled, filtered for sure.
##
## . This information is available in the seurat object itself,
## within the slot `@commands`.
```
See an example below:
```{r a_normscalfilt2, class.source = c("fold-hide", "answer"), eval = FALSE}
# a_normscalfilt
## . These counts are normalized, scaled, filtered for sure.
##
## . Total counts, max value, counts per cell and number of
## expressed features per cell, all have decreased.
##
## . This information is available in the seurat object itself,
## within the slot `@commands`. See an example below:
```
```{r seurat_history, eval=TRUE, echo=TRUE, class.source = c("fold-hide", "answer")}
# seurat_history
names(sobj@commands)
```
**However**, please remember that :
- Counts in the slot `count` are **raw counts**
- **Normalized counts** are in the slot `data`
- **scaled** data are in the slot `scaled.data` (this one was damn obvious).
And it you do not find that crystal-clear, I totally agree with you.
- Raw counts have many, so many zeros ...
```{r q_sozero, class.source="question", eval = FALSE}
# q_sozero
## Is it normal to you ?
```
```{r a_sozero, class.source = c("fold-hide", "answer"), eval = FALSE}
# a_sozero
## . The large number of null counts is completely normal for this
## "droplet" technology. In maths/stats, we talk about "matrix sparcity",
## i.e a table with lots (lots!) of zeros.
```
- And count values are so low ...
```{r q_lowcounts, class.source="question"}
# q_lowcounts
## Why ? Is it because we downsampled the read depth for this training ?
```
```{r a_lowcounts, class.source = c("fold-hide", "answer"), eval = FALSE}
# a_lowcounts
## . No. Once again, this is intrinsic to the technology used.
##
## . If the data were to be downsampled, we would had done this by cropping
## the count matrix over a small chromosome, or a reduced gene est, not
## by lowering the global amount of reads (depth).
```
# Select a DE method
## Available methods
[Seurat](https://satijalab.org/seurat/articles/de_vignette) let us use multiple
differential analysis methods with its function [`FindMarkers`](https://satijalab.org/seurat/reference/findmarkers).
1. [wilcox](https://www.rdocumentation.org/packages/rstatix/versions/0.7.1): The wilcoxon test tests the mean of expression and looks for a difference in these means.
1. [MAST](https://www.bioconductor.org/packages/release/bioc/vignettes/MAST/inst/doc/MAST-Intro.html): This tool has been built for Single Cell. It is based on a statistical model called ["Hurdle Model"](https://en.wikipedia.org/wiki/Hurdle_model), which excells with data that contains lots of zeros (which is our case in Single Cell RNA-Seq: most of the genes are *not* expressed.)
1. [DESeq2](https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#recommendations-for-single-cell-analysis): This tool has originally been built for bulk RNA-Seq but now includes specific functions for Single Cell. It performs well when counts are highly variable or when you wand to compare a handful of cells.
1. [t-test](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/t.test): The t-test uses a comparison of means to assess differential expression probability.
1. [roc](https://en.wikipedia.org/wiki/Receiver_operating_characteristic): In this method, an AUC value of `1` means that expression values for this gene alone can perfectly classify the two groupings, i.e : each of the cells in "cells.1" (condition 1) exhibit a higher level than each of the cells in "cells.2" (condition 2). An AUC value of `0` also means there is perfect classification, but in the opposite direction (higher in condition 2 than in 1). A value of `0.5` implies that the gene has no predictive power to classify the two groups.
The main question now is **how to choose the right test ??**
**Spoilers** : there are no option better than another in all ways ! Thank you for your attention, you can go back home now.
From Soneson & Robinson (2018) Nature Methods:

Here, researchers have designed an **artificial dataset** where they knew in advance the list of differentially expressed genes. They have used all these algorithms and consigned the results.
1. `DESeq2`, `Limma` seem to have a higher number of false positives (genes identified as differentially expressed while they were actually not.)
1. `Wilcoxon` seems to be better in general.
1. `MAST` performs well in absolute
`ANOVA` was not present in this study (but probably would have performed similarly to `t-test`.
## A case: the `Plac8` gene
The question is now to guess whether this gene is differentially expressed, or not.
### Cell observations
Let's have a look at the gene named ['Plac8'](https://www.genecards.org/cgi-bin/carddisp.pl?gene=PLAC8&keywords=plac8)
It is known to be involved in positive regulation of cold-induced thermogenesis,
and positive regulation of transcription by RNA polymerase II.
In order to plot its expression across all cells, we are going to use the
function [`VlnPlot`](https://satijalab.org/seurat/reference/vlnplot)
from the `Seurat` package.
The input object is obviously contained in the `sobj` variable, since it is
our only Seurat object. In addition, we are going to select the feature `Plac8`,
and split the graph according to the clusters we annotated earlier.
```{r seurat_vlnplot_Plac8_demo, eval=TRUE, echo=TRUE}
# seurat_vlnplot_Plac8_demo
Seurat::VlnPlot(
# A subset of the Seurat object
# limited to clusters 8 and 10,
# or else we will plot all the clusters
object = subset(sobj, HarmonyStandalone_clusters %in% c("8", "10")),
# The name of the gene of interest (feature = gene)
features = "Plac8",
# The name of the Seurat cell annotation
split.by = "HarmonyStandalone_clusters",
# Change color for presentation
cols = c("darkslategray3", "olivedrab")
)
```
```{r q_isplac8de, class.source="question", eval = FALSE}
# q_isplac8de
## Using your intuition, is this gene differentially expressed
## between clusters 8 and 10 ?
```
```{r a_isplac8de, class.source = c("fold-hide", "answer"), eval = FALSE}
# q_isplac8de
## . In cluster 10, the violin plot highlights almost no cells with low or
## zero `Plac8` expression. The highest density of cells has a `Plac8`
## normalized expression aroung ~1.5.
##
## . In cluster 8, cells seem to have no expression of `Plac8` at all.
##
## . IMHO (feel free to disagree), the expression of the gene `Plac8` differs
## between our clusters 8 and 10. This is purely intuitive.
```
Now, let's check the same gene expression, but comparing cells tagged as in
the G1 and S phase.
```{r vlnplot_seurat_group_phase_code, echo=TRUE, eval=TRUE}
# vlnplot_seurat_group_phase_code
Seurat::VlnPlot(
# The Seurat object
object = sobj,
# The name of the gene of interest (feature = gene)
features = "Plac8",
# The name of the Seurat cell-cycle annotation
group.by = "CC_Seurat_Phase",
# Change color for presentation
cols = c("darkslategray3", "olivedrab", "orangered3")
)
```
```{r q_isplac8deG1S, class.source="question", eval = FALSE}
# q_isplac8deG1S
## Using your intuition, is this gene differentially expressed
## between the G1 and S cell cycle phases ?
```
```{r a_isplac8deG1S, class.source = c("fold-hide", "answer"), eval = FALSE}
# a_isplac8deG1S
## . This one seems more tricky... One can observe that most of the expression
## is null, some cells express the gene.
##
## . We need to investigate a bit deeper !
```
Okay, let's get some information about these expression distributions.
```{r general_count_table, eval=TRUE, echo=TRUE}
# general_count_table
## Store counts in a variable
counts <- base::as.data.frame(
# The matrix to reformat into a dataframe
x = SeuratObject::GetAssayData(
object = sobj,
assay = "RNA",
layer = "data")
)
## Rename cells with cell harmony cluster
base::colnames(counts) <- paste(
# The names of the cell cluster for each cell
sobj$CC_Seurat_Phase,
# The names of the cells themselves
colnames(sobj),
sep = "_"
)
```
Let's check our `counts` variable :
```{r general_count_table_display, eval=TRUE, echo=FALSE}
# general_count_table_display
DT::datatable(head(counts))
```
We have `r length(colnames(sobj[, sobj$CC_Seurat_Phase == "G1"]))` cells within the `G1 group` :
```{r Plac8_summaries_G1, eval=TRUE}
# Plac8_summaries_G1
countsG1 <- select(counts, matches("^G1."))
Plac8G1 <- base::as.data.frame(base::t(countsG1["Plac8", ]))
base::summary(Plac8G1)
```
We have `r length(colnames(sobj[, sobj$CC_Seurat_Phase == "S"]))` cells withing the `S group` :
```{r Plac8_summaries_S, eval=TRUE}
# Plac8_summaries_S
countsS <- select(counts, matches("^S."))
Plac8S <- base::as.data.frame(base::t(countsS["Plac8", ]))
base::summary(Plac8S)
```
### From biology to statistics
Okay, let us resort on statistics to evaluate our chances to be guess
correctly, or our risks to guess wrong.
We have lots of observations :
- `r base::length(base::colnames(sobj[, sobj$CC_Seurat_Phase == "G1"]))`
cells within the G1 phase
- `r base::length(base::colnames(sobj[, sobj$CC_Seurat_Phase == "S"]))`
cells within the S phase
Statisticians really like to have **a lot of observations** !
Ideally, statisticians never have enough observations, but at least they prefer
when they have **more than tests**.
Here, we have a total of `r base::length(base::colnames(sobj[, sobj$CC_Seurat_Phase == "G1"])) + base::length(base::colnames(sobj[, sobj$CC_Seurat_Phase == "S"]))`
observations and we are testing a single gene.
We live in a statistician's wet dream !
Now, if we think like a statistician, we can ask ourselves :
- Are our cells supposed to be interacting with each others ?
- Or are they independent from each others ?
- Does the expression in our cells follow a normal distribution ?
All of this is very important, and usually, it requires a discussion.
For the normal distribution, there is an easy check.
Let's draw the expression like we did above.
First, we use the function [`rbind`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/cbind)
from `base` package.
Be careful, the function `rbind` also exists
in `DelayedArray`, `data.table`, and `BiocGenerics` packages !
We want to use the "basic" one.
```{r distribution_Plac8_table_build, eval=TRUE, echo=TRUE}
# distribution_Plac8_table_build
## Add column idientifiers in each count dataframes
Plac8G1$phase <- "G1"
Plac8S$phase <- "S"
## Paste the rows beneith each other
Plac8 <- base::rbind(
## variable pointing to G1 counts
Plac8G1,
## variable pointing to S counts
Plac8S,
## A Boolean, requesting that strings/characters
## should not be casted as `logical`. It breaks graphs.
stringsAsFactors = FALSE
)
```
Secondly, we use the function [`gghistogram`](https://www.rdocumentation.org/packages/ggpubr/versions/0.6.0/topics/gghistogram) from the package `ggpubr` in order to display relative abundance of gene expression :
```{r distribution_Plac8_display, eval=TRUE, echo=TRUE}
# distribution_Plac8_display
ggpubr::gghistogram(
Plac8,
x = "Plac8",
y = '..density..',
fill = "steelblue",
bins = 15,
add_density = TRUE
)
```
Our distribution doesn't seem to be normal, nor binomial ... we will have to
rely on **non-parametric** tests.
Let's run a non-parametric test based on the mean of distributions,
since it's the clothest to our 'intuitive' approach. Let there be
[Wilcoxon](https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test) test.
In R, it's quite straightforward : we have the function
[`wilcoxon_test`](https://www.rdocumentation.org/packages/rstatix/versions/0.7.1)
to perform the test, then we can plot the result.
```{r wilcoxon_Plac8, eval = TRUE, echo=TRUE}
# wilcoxon_Plac8
## On the expression table stored in the varialbe `Plac8`,
## first apply the function `wilcox_test` from package `rstatix`,
## then we apply the function `add_significance` from package `rstatix`
stat.test <- Plac8 %>% rstatix::wilcox_test(Plac8 ~ phase) %>% rstatix::add_significance()
# While doing so, we usually also compute effect size
eff.size <- Plac8 %>% rstatix::wilcox_effsize(Plac8 ~ phase)
```
```{r display_wilcoxon_Plac8_result, echo = FALSE, eval = TRUE}
# display_wilcoxon_Plac8_result
DT::datatable(stat.test, caption = "Wilcoxon test result")
stat.test <- Plac8 %>% rstatix::t_test(Plac8 ~ phase) %>% rstatix::add_significance()
```
Wilcoxon test says: the distributions are different, with a `r stat.test$p * 100`
% of risks of being wrong. The gene `Plac8` can safely be said **differentially
expressed**.
We can compute a fold change and conclude.
**NOTE** : Just out of curiosity, be aware that [`t_test`](https://www.rdocumentation.org/packages/rstatix/versions/0.7.2/topics/t_test)
from `rstatix` package, gives the same answer (with different significance).
However, [`DESeq`](https://satijalab.org/seurat/reference/findmarkers) gives
a p-value of 0.05 and an adjusted p-value equal to 1 !
Depending on the test used, this gene is, or is not differentially expressed.
## Conclusion
In conclusion, to select your DEA method, use the following rules of thumb :
1. If you have already done a study with one of these methods, keep using
the same. This is crutial if you ever want to compare your new result with
the old ones.
2. If you want to compare your study with a published one, use the same method.
3. If you have no idea, use Wilcoxon.
4. If you have **bulk** data analyzed with `DESeq2`/`Limma`, use
`DESeq2`/`Limma`. It will be easier to take both works in consideration.
**Please, never ever use a simple Wilcoxon on bulk RNA-Seq data !!**
## NB MAST DE
MAST uses a hurdle model divided into two parts:
- Discrete – models the probability that a feature is detected in a cell (present or not).
- Continuous – models the level of expression for the feature in cells where it is detected.
MAST also uses the Cellular Detection Rate (CDR) as a covariate to adjust for sequencing effects, calculated as the fraction of detected genes per cell
(nFeature barcode >0 / nHVG). It is also possible to include additional covariates, such as batch or experimental metrics, when analyzing more than two samples.
# Select a dataset
## Dataset depends on selected method
There it is quite easier:
```{r choose_counts, eval=TRUE, results='asis', echo=FALSE}
choose_counts <- as.data.frame(t(data.frame(
wilcox = c("Normalized counts", "sojb@assays[['RNA']]@data"),
t_test = c("Normalized counts", "sojb@assays[['RNA']]@data"),
ROC = c("Normalized counts", "sojb@assays[['RNA']]@data"),
ANOVA = c("Normalized counts", "sojb@assays[['RNA']]@data"),
MAST = c("Raw counts", "sojb@assays[['RNA']]@counts"),
DESeq2 = c("Raw counts", "sojb@assays[['RNA']]@counts"),
Limma = c("Raw counts", "sojb@assays[['RNA']]@counts")
)))
colnames(choose_counts) <- c("Counts", "slot name")
DT::datatable(choose_counts, caption = "How to select your counts")
```
## FindMarkers
With the function [`FindMarkers`](https://satijalab.org/seurat/reference/findmarkers)
from package `Seurat`, we want to make three groups:
1. One using `wilcoxon` to perform DEA between clusters "8" and "10".
1. One using `t`-test to perform DEA between clusters "8" and "10".
1. One using `MAST` to perform DEA between "8" and "10".
We will observe the results and compare our gene lists.
Hey, why are you looking at me? It's your turn to work! Use the all the
notions seen above to select the right counts (`slot`), the right input
object, and the right arguments.
10 minutes practice !
Answers
Here are the code for each team:
```{r findmarkers_all_de, echo=TRUE, eval=TRUE}
sobj_wilcox <- Seurat::FindMarkers(
# The variable that contains Seurat Object
object = sobj,
# Name of condition 1
ident.1 = "8",
# Name of condition 2
ident.2 = "10",
# Factor name in the Seurat Object
group.by = "HarmonyStandalone_clusters",
# Differential analysis method
test.use = "wilcox"
)
sobj_t <- Seurat::FindMarkers(
# The variable that contains Seurat Object
object = sobj,
# Name of condition 1
ident.1 = "8",
# Name of condition 2
ident.2 = "10",
# Factor name in the Seurat Object
group.by = "HarmonyStandalone_clusters",
# Differential analysis method
test.use = "t"
)
sobj_mast <- Seurat::FindMarkers(
# The variable that contains Seurat Object
object = sobj,
# Name of condition 1
ident.1 = "8",
# Name of condition 2
ident.2 = "10",
# Factor name in the Seurat Object
group.by = "HarmonyStandalone_clusters",
# Differential analysis method
test.use = "MAST"
)
```
```{r load_dea, eval=TRUE, echo=FALSE}
base::saveRDS(file="sobj_wilcox.RDS", object=sobj_wilcox)
```
In the function argument, there is a FoldChange threshold. Should we
filter gene expression based on FoldChange? In case of positive answer,
how much should that threshold be?
Answer
About thresholds on FDR (False Discovery Rate) and Log2(FC) (Log of the Fold Change), there are many discussions.
Remember, here the threshold on Fold Change is Logged. A `log2(1) = ``r log2(1)`. And keep in mind the following:
1. If one selects a fold change threshold above 1.7, then their study concludes that smoking is not related to lung cancer.
1. If one selects a fold change threshold above 1, then their study concludes that a fast-food based diet does not lead to weight gain.
1. If one selects a fold change threshold above 1/25 000 000, then their study concludes: it is a complete hazard that mice have fetal malformation when in presence of Bisphanol.
In conclusion, there one, and only one reason to filter on fold change: in my experience, a fold change below 0.7 will be hard to see/verify on wet-lab (qRT).
If you need to reduce a too large number of differentially expressed genes, then reduce the FDR to 0.01, or even better, to 0.001. With that, you reduce your number of false claims.
Can you help me with `DEseq2`?
When I run the following command line, I have an error :
```{r seurat_run_deseq_with_error, eval=FALSE, echo=TRUE}
sobj_deseq2 <- Seurat::FindMarkers(
# The variable that contains Seurat Object
object = sobj,
# Name of condition 1
ident.1 = 8,
# Name of condition 2
ident.2 = 10,
# Factor name in the Seurat Object
group.by = "HarmonyStandalone_clusters",
# Differential analysis method
test.use = "deseq2",
# Use non-normalized data with DESeq2
slot = "counts"
)
```
> Error in PerformDE(object = object, cells.1 = cells.1, cells.2 = cells.2, :
> Unknown test: deseq2
Answer
Oh! My fault, it was a typo in my command! Thank you all for your help!
```{r build_count_p1_matrix, eval=TRUE, echo=TRUE}
sobj_deseq2 <- Seurat::FindMarkers(
# The variable that contains Seurat Object
object = sobj,
# Name of condition 1
ident.1 = "8",
# Name of condition 2
ident.2 = "10",
# Factor name in the Seurat Object
group.by = "HarmonyStandalone_clusters",
# Differential analysis method
test.use = "DESeq2",
# Use non-normalized data with DESeq2
slot = "counts"
)
```
Remark: by doing surch modification, some fold changes have been modified:
remember the gene Atad2 with a mean expression of 0.08 in G1, and 0.2 in S
phases? Mean expressions are now around 1.08 for G1, and 1.2 for S phases.
This may be the reason why it was not differentially expressed in DESeq2,
while Wilcoxon and T-test returned adjusted pvalues far below 0.05.
```{r save_de_results, eval=TRUE, echo=FALSE}
base::saveRDS(sobj_wilcox, "sobj_wilcox.RDS")
base::saveRDS(sobj_t, "sobj_t.RDS")
base::saveRDS(sobj_roc, "sobj_roc.RDS")
base::saveRDS(sobj_deseq2, "sobj_deseq2.RDS")
```
# Explore results
## Big tables
Let us have a look at the results:
```{r sobj_w_res_display, eval=TRUE, echo=FALSE}
DT::datatable(
head(sobj_wilcox, n = 10),
caption = "Wilcoxon test results"
)
```
We have the following columns:
1. `p_val`: Ignore this column. Always ignore raw p-values. Look at corrected ones, and if they are missing, then compute them.
1. `avg_log2FC`: Average Log2(FoldChange). Illustrates how much a gene is differentially expressed between samples in each condition.
1. `pct.1`: Percent of cells with gene expression in condition one, here in cluster 8.
1. `pct.2`: Percent of cells with gene expression in condition two, here in cluster 10.
1. `p_val_adj`: The confidence we have in the result. The closer to 0, the lesser is the risk of error.
```{r sobj_t_res_display, eval=TRUE, echo=FALSE}
DT::datatable(
head(sobj_t, n = 10),
caption = "T-test results"
)
```
We have the following columns:
1. `p_val`: Ignore this column. Always ignore raw p-values. Look at corrected ones, and if they are missing, then compute them.
1. `avg_log2FC`: Average Log2(FoldChange). Illustrates how much a gene is differentially expressed between samples in each condition.
1. `pct.1`: Percent of cells with gene expression in condition one, here in cluster 8.
1. `pct.2`: Percent of cells with gene expression in condition two, here in cluster 10.
1. `p_val_adj`: The confidence we have in the result. The closer to 0, the lesser is the risk of error.
```{r sobj_roc_res_display, eval=TRUE, echo=FALSE}
DT::datatable(
head(sobj_mast, n = 10),
caption = "MAST test results"
)
```
We have the following columns:
1. `p_val`: Ignore this column. Always ignore raw p-values. Look at corrected ones, and if they are missing, then compute them.
1. `avg_log2FC`: Average Log2(FoldChange). Illustrates how much a gene is differentially expressed between samples in each condition.
1. `pct.1`: Percent of cells with gene expression in condition one, here in cluster 8.
1. `pct.2`: Percent of cells with gene expression in condition two, here in cluster 10.
1. `p_val_adj`: The confidence we have in the result. The closer to 0, the lesser is the risk of error.
```{r sobj_deseq_res_display, eval=TRUE, echo=FALSE}
DT::datatable(
head(sobj_t, n = 10),
caption = "T-test results"
)
```
We have the following columns:
1. `p_val`: Ignore this column. Always ignore raw p-values. Look at corrected ones, and if they are missing, then compute them.
1. `avg_log2FC`: Average Log2(FoldChange). Illustrates how much a gene is differentially expessed between samples in each condition.
1. `pct.1`: Percent of cells with gene expression in condition one, here in cluster 8.
1. `pct.2`: Percent of cells with gene expression in condition two, here in cluster 10.
1. `p_val_adj`: The confidence we have in the result. The closer to 0, the lesser is the risk of error.
## Filter DEA results
What kind of threshold should be used to filter each results?
```{r extract_de_genes, eval=TRUE, echo=FALSE}
# We store a `list` in a variable called `data`
# The function `list` comes from `base` and not `biocGenerics`.
data <- base::list(
# We use a threshold of 5% on adjusted p-values
wilcox = base::rownames(sobj_wilcox[sobj_wilcox$p_val_adj <= 0.05, ]),
# We use a threshold of 5% on adjusted p-values
t_test = base::rownames(sobj_t[sobj_t$p_val_adj <= 0.05, ]),
# We use a threshold of 0.2 in AUC power
mast = base::rownames(sobj_mast[sobj_roc$power >= 0.2, ]),
# We use a threshold of 5% on adjusted p-values
deseq2 = base::rownames(sobj_deseq2[sobj_deseq2$p_val_adj <= 0.05, ])
)
```
> If we must label certain scores as good or bad, we can reference the
following rule of thumb:
>
> 0.5 = No discrimination
> 0.5-0.7 = Poor discrimination
> 0.7-0.8 = Acceptable discrimination
> 0.8-0.9= Excellent discrimination
> 0.9 = Outstanding discrimination
Hosmer and Lemeshow in Applied Logistic Regression (p. 177)
## Add results to Seurat objects
We'd like to store the results of differential expression analysis in
the `Seurat` object.
```{r add_results_seurat, echo=TRUE, eval=TRUE}
sobj@misc$wilcox <- sobj_wilcox
```
```{r save_sobjw, eval=TRUE, echo=FALSE}
base::saveRDS(sobj, "DEA_Scaled_Normalized_Filtered.RDS")
```
## Common results
Now we can plot intersections in an up-set graph. It is like a Venn diagram:
```{r upset_seurat_de_methods, eval=TRUE, echo=TRUE}
UpSetR::upset(
data = UpSetR::fromList(data),
order.by = "freq"
)
```
## Others functions in Seurat for DEA
There are two other functions in Seurat package to do DEA in different context. For example, you will see the difference between one cluster versus many of them. You could be use `FindAllMarkers`. Usage is equivalent to `FindMarkers` but it need to indicate which variable cluster variable.
```{r findallmarkers_dea}
sobj_findallmarkers_mast <- FindAllMarkers(sobj,
assay="RNA",
group.by="HarmonyStandalone_clusters",
test.use = "MAST",
only.pos = T,
logfc.threshold = 0.9,
min.pct = 0.1,
random.seed = my_seed)
DT::datatable(
head(sobj_findallmarkers_mast, n = 10),
caption = "MAST results"
)
p <- EnhancedVolcano(
toptable = sobj_findallmarkers_mast,
lab = rownames(sobj_findallmarkers_mast),
x = "avg_log2FC",
y = "p_val_adj",
FCcutoff = 0.2,
drawConnectors = FALSE,
title = 'Volcano by cluster'
)
p + facet_wrap(~ cluster)
```
Another function is `FindConservedMarkers`. It the same function of `FindMarkers` but this usage is to identify the same features expression between clusters
```{r findconservedmarkers_dea}
sobj_conservedmarkers_mast <- Seurat::FindConservedMarkers(
# The variable that contains Seurat Object
object = sobj,
# Name of condition 1
ident.1 = "8",
# Name of condition 2
ident.2 = "10",
# Factor name in the Seurat Object
grouping.var = "orig.ident",
# Differential analysis method
assay="RNA",
slot="counts"
)
DT::datatable(
head(sobj_conservedmarkers_mast, n = 10),
caption = "MAST results"
)
```
1. `TDCT_p_val`: Ignore this column. Always ignore raw p-values. Look at corrected ones, and if they are missing, then compute them.
1. `TDCT_avg_log2FC`: Average Log2(FoldChange). Illustrates how much a gene is differentially expessed between samples in TDCT condition.
1. `TDCT_pct.1`: Percent of cells with gene expression in condition one, here in cluster 8 in TDCT condition.
1. `TDCT_pct.2`: Percent of cells with gene expression in condition two, here in cluster 10 in TDCT condition.
1. `TDCT_p_val_adj`: The confidence we have in the result. The closer to 0, the lesser is the risk of error in TDCT condition.
1. `TD3A_p_val`: Ignore this column. Always ignore raw p-values. Look at corrected ones, and if they are missing, then compute them.
1. `TD3A_avg_log2FC`: Average Log2(FoldChange). Illustrates how much a gene is differentially expessed between samples in TD3A condition.
1. `TD3A_pct.1`: Percent of cells with gene expression in condition one, here in cluster 8 in TD3A condition.
1. `TD3A_pct.2`: Percent of cells with gene expression in condition two, here in cluster 10 in TD3A condition.
1. `TD3A_p_val_adj`: The confidence we have in the result. The closer to 0, the lesser is the risk of error in TD3A condition.
1. `max_pval`: Maximum of p_val between comparison TDCT vs TD3A. It must be low to indicate a conserved feature
1. `minimump_p_val`: Minimum of p_val between comparison TDCT vs TD3A.
```{r volcano_plac8 _by_sample}
Seurat::VlnPlot(
# A subset of the Seurat object
# limited to clusters 8 and 10,
# or else we will plot all the clusters
object = subset(sobj, HarmonyStandalone_clusters %in% c("8", "10")),
# The name of the gene of interest (feature = gene)
features = "Plac8",
# The name of the Seurat cell annotation
split.by = "orig.ident",
# Change color for presentation
cols = c("blue", "red")
)
sobj_conservedmarkers_mast$direction = case_when(
sign(sobj_conservedmarkers_mast$TDCT_avg_log2FC) == sign(sobj_conservedmarkers_mast$TD3A_avg_log2FC) &
sign(sobj_conservedmarkers_mast$TDCT_avg_log2FC) == -1 ~ "negative",
sign(sobj_conservedmarkers_mast$TDCT_avg_log2FC) == sign(sobj_conservedmarkers_mast$TD3A_avg_log2FC) &
sign(sobj_conservedmarkers_mast$TDCT_avg_log2FC) == 1 ~ "positive",
.default = "opposite")
t_dataframe <- table(sobj_conservedmarkers_mast$direction,sobj_conservedmarkers_mast$max_pval<0.05)[,2]
levels_direction <- paste0(names(t_dataframe)," (nFeature=",t_dataframe,")")
sobj_conservedmarkers_mast_filtered = sobj_conservedmarkers_mast[sobj_conservedmarkers_mast$minimump_p_val<0.05,]
sobj_conservedmarkers_mast_filtered$direction = factor(sobj_conservedmarkers_mast_filtered$direction,
levels=c("negative","opposite","positive"),
labels=levels_direction)
ggplot() +
geom_point(data=sobj_conservedmarkers_mast_filtered,
aes(x = TDCT_avg_log2FC, y = TD3A_avg_log2FC, color = direction)) +
scale_color_manual(values=c("lightblue","grey70","red")) +
theme_bw() +
labs(x = "TDCT Average Log2 FoldChange cluster 8 vs cluster 12",
y = "TD3A Average Log2 FoldChange cluster 8 vs cluster 12",
title = "Result of Wilcoxon Test between cluster 8 vs cluster 12 between condition TDCT vs TD3A",
color = "Direction Log2FoldChange")
```
## Heatmap
We'd like to display the expression of genes identified by FindMarkers.
Then we use the function [`DoHeatmap`](https://satijalab.org/seurat/reference/doheatmap)
from the package `Seurat`.
In order to limit the graph to differentially expressed reads, we use the
function [`rownames`](https://rdocumentation.org/packages/base/versions/3.6.2/topics/row.names)
from R `base` package on the DEA result table. In this example, I use
the results of wilcoxon, but you shall use any of the results you previously
obtained.
```{r seurat_heatmap, eval=TRUE, echo=TRUE}
Seurat::DoHeatmap(
# variable pointing to seurat object
object = sobj,
# name of DE genes
features = base::rownames(sobj_wilcox),
# Cluster annotation
group.by = "HarmonyStandalone_clusters",
)
```
## Volcano plot
A Volcano plot is usefull to identify differnetial expression
analysis bias.
The package `EnhancedVolcano` has an [eponym](https://bioconductor.org/packages/3.17/bioc/html/EnhancedVolcano.html)
function for that:
```{r enhanced_volcanoplot, eval=TRUE, echo=TRUE}
EnhancedVolcano::EnhancedVolcano(
# variable pointing to the DEA results
toptable = sobj_wilcox,
# Gene names
lab = rownames(sobj_wilcox),
# Column in which to find Fold Change
x = "avg_log2FC",
# Column in which to find confidence interval
y = "p_val_adj",
# Lower fold-change cut-off
FCcutoff = 0.2
)
```
## Session Info
This list of all packages used while you work should be included
in each en every R presentation:
```{r session_info, eval=TRUE, echo=TRUE}
utils::sessionInfo()
```