--- title: "EB3I n1 2025 scRNAseq
-
Differential expression analysis" date: "2025-16-21.22" author: - name: "EBAII n1 scRNAseq Team" - name: "Thibault DAYRIS" email: "thibault.dayris@gustaveroussy.fr" - name: "Bastien JOB" email: "bastien.job@gustaveroussy.fr" - name: "William JARASSIER" email: "w.jarassier@gmail.com" output: rmdformats::readthedown: fig_width: 8 fig_height: 6 highlight: tango ## Theme for the code chunks embed_fonts: TRUE number_sections: true ## Adds number to headers (sections) theme: flatly ## CSS theme for the HTML page collapsed: true ## By default, the TOC is folded toc_depth: 3 smooth_scroll: true ## Smooth scroll of the HTML page self_contained: true ## Includes all plots/images within the HTML code_download: true ## Adds a button to download the Rmd code_folding: show thumbnails: false lightbox: true fig_caption: false gallery: true use_bookdown: true always_allow_html: true ## Allow plain HTML code in the Rmd editor_options: markdown: wrap: 72 --- ```{r knit_setup, echo = FALSE} knitr::opts_chunk$set( echo = TRUE, # Print the code eval = TRUE, # Run command lines message = FALSE, # Print messages prompt = FALSE, # Do not display prompt comment = NA, # No comments on this section warning = FALSE, # Display warnings tidy = FALSE, fig.align="center", # results = 'hide', width = 100 # Number of characters per line ) ``` ```{css, echo=FALSE} .notrun { background-color: lightgrey !important; border: 3px solid black !important; } .notruno { background-color: lightgrey !important; color : black !important; } .question { background-color: aquamarine !important; color : black !important; border: 3px solid limegreen !important; } .questiono { background-color: aquamarine !important; color : black !important; } .answer { background-color: navajowhite !important; border: 3px solid brown !important; } .answero { background-color: navajowhite !important; color : black !important; } .beyond { background-color: violet !important; border: 3px solid purple !important; } .beyondo { background-color: violet !important; color : black !important; } ``` ```{r knit_hook, echo = FALSE} hooks = knitr::knit_hooks$get() hook_foldable = function(type) { force(type) function(x, options) { res = hooks[[type]](x, options) if (isFALSE(options[[paste0("fold.", type)]])) return(res) paste0( "

Show ", type, "

\n\n", res, "\n\n

" ) } } knitr::knit_hooks$set( output = hook_foldable("output"), plot = hook_foldable("plot") ) ``` ------------------------------------------------------------------------ ------------------------------------------------------------------------ ![](eb3i_banner.png) ------------------------------------------------------------------------ ------------------------------------------------------------------------ # Start Rstudio - Using the [OpenOnDemand/Rstudio cheat sheet](https://moodle.france-bioinformatique.fr/pluginfile.php/1475/mod_folder/content/0/OoD_R_Rstudio.html), connect to the [OpenOnDemand portal](https://ondemand.cluster.france-bioinformatique.fr) and create a Rstudio session with the right resource requirements. # Warm-up - We set common parameters we will use throughout this session : ```{r setparam} # setparam ## Set your project name # WARNING : Do not just copy-paste this ! It's MY project name ! Put YOURS !! project_name <- "ebaii_sc_teachers" ## Control if the project_name exists on the cluster cat('PATH CHECK : ', dir.exists(paste0('/shared/projects/', project_name))) ## Seed for the RNG my_seed <- 1337L ``` ------------------------------------------------------------------------ ------------------------------------------------------------------------ # Prepare the data structure We will do the same as for former steps, just changing the session name : ## Main directory ```{r maindir} #maindir ## Preparing the path TD_dir <- paste0("/shared/projects/", project_name, "/SC_TD") ## Creating the root directory # dir.create(path = TD_dir, recursive = TRUE) ## Print the root directory on-screen print(TD_dir) ``` ## Current session ```{r sessiondir} # sessiondir ## Creating the session (Preproc.2) directory session_dir <- paste0(TD_dir, "/05_Proc.2") dir.create(path = session_dir, recursive = TRUE) ## Print the session directory on-screen print(session_dir) ``` ## Input directory ```{r indir} #indir ## Creating the INPUT data directory input_dir <- paste0(session_dir, "/DATA") dir.create(path = input_dir, recursive = TRUE) ## Print the input directory on-screen print(input_dir) ``` ## Output directory ```{r outdir} #outdir ## Creating the OUTPUT data directory output_dir <- paste0(session_dir, "/RESULTS") dir.create(path = output_dir, recursive = TRUE) ## Print the output directory on-screen print(output_dir) ``` ------------------------------------------------------------------------ ------------------------------------------------------------------------ # Reload the Seurat Object - We can reload the object we saved at the former step ```{r dataload} ## dataload ## This is the path to the current EB3I backup sessionid <- '2538_eb3i_n1_2025' ## The latest Seurat object saved as RDS (name) sobj_file <- "12_TD3A.TDCT_S5_Integrated_12926.3886.RDS" ## The latest Seurat object saved as RDS (full path) sobj_path <- paste0(TD_dir, "/05_DEA/RESULTS/", sobj_file) force <- TRUE ## To force a re-download of a Zenodo-hosted backup local <- FALSE ## To force a loading from a local backup ## In case of error/lost data : force a reload from a Zenodo backup repository if(force) { zen_id <- "14035293" zen_backup_file <- paste0("https://zenodo.org/records/", zen_id, "/files/", sobj_file) ## Recreate the expected path if it does not exist dir.create(path = dirname(sobj_path), recursive = TRUE) ## Download the file download.file(url = zen_backup_file, destfile = sobj_path) } ## In case of error/lost data : force a reload from a local backup repository if(local) { sobj_path <- paste0( "/shared/projects/", sessionid, "/atelier_scrnaseq/TD/BACKUP/RDS/", sobj_file) } ## Load the object sobj <- readRDS(file = sobj_path) ``` # Forewords I need three teams for this session: - Team [`Wilcoxon`](https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test) - Team [`Student`](https://en.wikipedia.org/wiki/Student%27s_t-test) - Team [`MAST`] I will also need your help because I can't make [DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html) work correctly. But I'm sure, that we will solve my issue: you're in the best session here at [EB3I](https://github.com/IFB-ElixirFr/EBAII). ## TLDR: R command lines In this presentation, there will be screen captures for you to follow the lesson. There will also be every single R command lines. Do not take care of the command lines if you find them too challenging. Our goal here, is to **understand** the main mechanism of Differential Expression Analysis. R is _just a tool_. Below are the libraries we need to perform this whole session: ```{r load_libraries, eval=TRUE, echo=TRUE, class.source="notrun", class.output="notruno"} base::library(package = "BiocParallel") # Optionally multithread some steps base::library(package = "DT") # Display nice tables in HTML base::library(package = "ggplot2") # Draw nice graphs and plots base::library(package = "ggpubr") # Draw even nicer graphs base::library(package = "rstatix") # Base R statistics base::library(package = "knitr") # Build this presentation base::library(package = "dplyr") # Handle big tables base::library(package = "Seurat") # Handle SingleCell analyses base::library(package = "SeuratObject") # Handle SingleCell objects for Seurat base::library(package = "SingleCellExperiment") # Handle SingleCell file formats base::library(package = "UpSetR") # Nice venn-like graphs base::library(package = "EnhancedVolcano") # Draw Volcano plot ``` Then we join layers: ```{r join_layers_tldr, eval=FALSE, echo=TRUE, class.source="notrun", class.output="notruno"} # join_layers_tldr Seurat::Idents(sobj) <- sobj$orig.ident sobj <- SeuratObject::JoinLayers(sobj) ``` Then we perform differential expression analysis: ```{r run_dea, eval=FALSE, echo=TRUE, class.source="notrun", class.output="notruno"} # run_dea sobj_de <- Seurat::FindMarkers( # Object on which to perform DEA object = sobj, # Name of factor in condition 1 ident.1 = "TD3A", # Name of factor in condition 2 ident.2 = "TDCT" ) ``` And that's all ! Our goal is to understand these lines, being able to write them is a bonus. ## Purpose of this session Up to now, we have: 1. Identified to which cell each sequenced reads come from 2. Identified to which gene each read come from 3. Identified possible bias in gene expression for each cell 4. Filtered and corrected these bias as well as we can We would like to identify the list of genes that characterize differences between cell cycle phases G1 and S groups. At the end of this session you will know: 1. how to select a differential analysis method 2. how to select the correct normalization (if any?) that must be provided to your differential analysis method 3. How to read differential expression results ## Insight We are wondering what's in our dataset. Let's have a look, with the function [`print`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/print) form the package `base`. ```{r print_seurat, eval=TRUE, echo=TRUE} # print_seurat base::print( # Object to display x = sobj ) ``` We have `r base::dim(SeuratObject::GetAssayData(object = sobj, assay = "RNA", layer = "data.TD3A"))[1]` features (_aka_ genes), across `r base::dim(SeuratObject::GetAssayData(object = sobj, assay = "RNA", layer = "data.TD3A"))[2]` samples (_aka_ cells) in the TD3A condition. We have `r base::dim(SeuratObject::GetAssayData(object = sobj, assay = "RNA", layer = "data.TDCT"))[1]` features (_aka_ genes), across `r base::dim(SeuratObject::GetAssayData(object = sobj, assay = "RNA", layer = "data.TDCT"))[2]` samples (_aka_ cells) in the TD3A condition. Let us have a look at the RNA counts for 10 cells and their annotation, with the function [`head`](https://www.rdocumentation.org/packages/utils/versions/3.6.2/topics/head) from the package `utils`. ```{r head_seurat_counts_phase, eval=FALSE, echo=TRUE} # head_seurat_counts_phase utils::head( # Object to visualize x = sobj, # Number of lines to display n = 10 ) ``` Prettier with a table in HTML : ```{r head_seurat_counts_phase_dt, eval=TRUE, echo=FALSE} # head_seurat_counts_phase_dt tmp <- utils::head(x = sobj, n = 10) DT::datatable(data = tmp) ``` There are no counts, normalized or not !! Where are they ? In order to explore the content of `sobj`, use the function `str` from the package `utils`: ```{r str_seurat_object, eval=FALSE, echo=TRUE} # str_seurat_object utils::str( # Object to explore object = sobj@assays ) ``` Alternatively, in RStudio, you can click on the object pane and explore manually the content of the object. If we explore the slot `assays`, then we find the counts. You can access them with: ```{r head_seurat_count_table, eval=FALSE, echo=TRUE} # head_seurat_count_table utils::head( # Object to visualize x = SeuratObject::GetAssayData(object = sobj, assay = "RNA", layer = "counts"), # Number of rows to display n = 10 ) ``` Prettier with a table in HTML : ```{r head_seurat_count_table_dt, eval=TRUE, echo=FALSE} # head_seurat_count_table_dt tmp <- utils::head( x = SeuratObject::GetAssayData(object = sobj, assay = "RNA", layer = "counts"), 10 ) DT::datatable( data = base::as.data.frame(tmp)[, 1:5] ) ``` We have one gene per line, one cell per column, and RNA counts in each row. For the sake of this session, we won't compare the whole dataset. It would take up to 15 minutes to complete. During the rest of this session, we will compare the Louvain clusters 8 and 10, from our integration with Harmony. We need to re-annotate layers to do so. This is done in two steps: 1. Redefine [`Idents`](https://www.rdocumentation.org/packages/Seurat/versions/3.1.4/topics/Idents) in order to be able to call cells by their cluster names. 2. [`JoinLayers`](https://www.rdocumentation.org/packages/SeuratObject/versions/5.0.2/topics/JoinLayers) to have a single count table with both TD3A and TDCT cells from clusters 8 and 10 together. ```{r ident_and_join_layers, eval=TRUE, echo=TRUE} # ident_and_join_layers Seurat::Idents(sobj) <- sobj$HarmonyStandalone_clusters sobj <- SeuratObject::JoinLayers(sobj) ``` Let's check if our cells are joint: ```{r check_joint_layers} # check_joint_layers base::print(sobj) ``` **Question** : - We have one gene per line, one cell per column, and RNA counts in each row. ```{r q_normscalfilt, class.source="question", eval = FALSE} # q_normscalfilt ## Are these counts normalized ? scaled ? filtered ? ```
```{r a_normscalfilt, class.source = c("fold-hide", "answer"), eval = FALSE} # a_normscalfilt ## . These counts are normalized, scaled, filtered for sure. ## ## . This information is available in the seurat object itself, ## within the slot `@commands`. ``` See an example below: ```{r a_normscalfilt2, class.source = c("fold-hide", "answer"), eval = FALSE} # a_normscalfilt ## . These counts are normalized, scaled, filtered for sure. ## ## . Total counts, max value, counts per cell and number of ## expressed features per cell, all have decreased. ## ## . This information is available in the seurat object itself, ## within the slot `@commands`. See an example below: ``` ```{r seurat_history, eval=TRUE, echo=TRUE, class.source = c("fold-hide", "answer")} # seurat_history names(sobj@commands) ``` **However**, please remember that : - Counts in the slot `count` are **raw counts** - **Normalized counts** are in the slot `data` - **scaled** data are in the slot `scaled.data` (this one was damn obvious). And it you do not find that crystal-clear, I totally agree with you.
- Raw counts have many, so many zeros ... ```{r q_sozero, class.source="question", eval = FALSE} # q_sozero ## Is it normal to you ? ```
```{r a_sozero, class.source = c("fold-hide", "answer"), eval = FALSE} # a_sozero ## . The large number of null counts is completely normal for this ## "droplet" technology. In maths/stats, we talk about "matrix sparcity", ## i.e a table with lots (lots!) of zeros. ```
- And count values are so low ... ```{r q_lowcounts, class.source="question"} # q_lowcounts ## Why ? Is it because we downsampled the read depth for this training ? ```
```{r a_lowcounts, class.source = c("fold-hide", "answer"), eval = FALSE} # a_lowcounts ## . No. Once again, this is intrinsic to the technology used. ## ## . If the data were to be downsampled, we would had done this by cropping ## the count matrix over a small chromosome, or a reduced gene est, not ## by lowering the global amount of reads (depth). ``` # Select a DE method ## Available methods [Seurat](https://satijalab.org/seurat/articles/de_vignette) let us use multiple differential analysis methods with its function [`FindMarkers`](https://satijalab.org/seurat/reference/findmarkers). 1. [wilcox](https://www.rdocumentation.org/packages/rstatix/versions/0.7.1): The wilcoxon test tests the mean of expression and looks for a difference in these means. 1. [MAST](https://www.bioconductor.org/packages/release/bioc/vignettes/MAST/inst/doc/MAST-Intro.html): This tool has been built for Single Cell. It is based on a statistical model called ["Hurdle Model"](https://en.wikipedia.org/wiki/Hurdle_model), which excells with data that contains lots of zeros (which is our case in Single Cell RNA-Seq: most of the genes are *not* expressed.) 1. [DESeq2](https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#recommendations-for-single-cell-analysis): This tool has originally been built for bulk RNA-Seq but now includes specific functions for Single Cell. It performs well when counts are highly variable or when you wand to compare a handful of cells. 1. [t-test](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/t.test): The t-test uses a comparison of means to assess differential expression probability. 1. [roc](https://en.wikipedia.org/wiki/Receiver_operating_characteristic): In this method, an AUC value of `1` means that expression values for this gene alone can perfectly classify the two groupings, i.e : each of the cells in "cells.1" (condition 1) exhibit a higher level than each of the cells in "cells.2" (condition 2). An AUC value of `0` also means there is perfect classification, but in the opposite direction (higher in condition 2 than in 1). A value of `0.5` implies that the gene has no predictive power to classify the two groups. The main question now is **how to choose the right test ??** **Spoilers** : there are no option better than another in all ways ! Thank you for your attention, you can go back home now.
From Soneson & Robinson (2018) Nature Methods: ![de_tools](images/DE_tools.png) Here, researchers have designed an **artificial dataset** where they knew in advance the list of differentially expressed genes. They have used all these algorithms and consigned the results. 1. `DESeq2`, `Limma` seem to have a higher number of false positives (genes identified as differentially expressed while they were actually not.) 1. `Wilcoxon` seems to be better in general. 1. `MAST` performs well in absolute `ANOVA` was not present in this study (but probably would have performed similarly to `t-test`. ## A case: the `Plac8` gene The question is now to guess whether this gene is differentially expressed, or not. ### Cell observations Let's have a look at the gene named ['Plac8'](https://www.genecards.org/cgi-bin/carddisp.pl?gene=PLAC8&keywords=plac8) It is known to be involved in positive regulation of cold-induced thermogenesis, and positive regulation of transcription by RNA polymerase II. In order to plot its expression across all cells, we are going to use the function [`VlnPlot`](https://satijalab.org/seurat/reference/vlnplot) from the `Seurat` package. The input object is obviously contained in the `sobj` variable, since it is our only Seurat object. In addition, we are going to select the feature `Plac8`, and split the graph according to the clusters we annotated earlier. ```{r seurat_vlnplot_Plac8_demo, eval=TRUE, echo=TRUE} # seurat_vlnplot_Plac8_demo Seurat::VlnPlot( # A subset of the Seurat object # limited to clusters 8 and 10, # or else we will plot all the clusters object = subset(sobj, HarmonyStandalone_clusters %in% c("8", "10")), # The name of the gene of interest (feature = gene) features = "Plac8", # The name of the Seurat cell annotation split.by = "HarmonyStandalone_clusters", # Change color for presentation cols = c("darkslategray3", "olivedrab") ) ``` ```{r q_isplac8de, class.source="question", eval = FALSE} # q_isplac8de ## Using your intuition, is this gene differentially expressed ## between clusters 8 and 10 ? ``` ```{r a_isplac8de, class.source = c("fold-hide", "answer"), eval = FALSE} # q_isplac8de ## . In cluster 10, the violin plot highlights almost no cells with low or ## zero `Plac8` expression. The highest density of cells has a `Plac8` ## normalized expression aroung ~1.5. ## ## . In cluster 8, cells seem to have no expression of `Plac8` at all. ## ## . IMHO (feel free to disagree), the expression of the gene `Plac8` differs ## between our clusters 8 and 10. This is purely intuitive. ```
Now, let's check the same gene expression, but comparing cells tagged as in the G1 and S phase. ```{r vlnplot_seurat_group_phase_code, echo=TRUE, eval=TRUE} # vlnplot_seurat_group_phase_code Seurat::VlnPlot( # The Seurat object object = sobj, # The name of the gene of interest (feature = gene) features = "Plac8", # The name of the Seurat cell-cycle annotation group.by = "CC_Seurat_Phase", # Change color for presentation cols = c("darkslategray3", "olivedrab", "orangered3") ) ``` ```{r q_isplac8deG1S, class.source="question", eval = FALSE} # q_isplac8deG1S ## Using your intuition, is this gene differentially expressed ## between the G1 and S cell cycle phases ? ``` ```{r a_isplac8deG1S, class.source = c("fold-hide", "answer"), eval = FALSE} # a_isplac8deG1S ## . This one seems more tricky... One can observe that most of the expression ## is null, some cells express the gene. ## ## . We need to investigate a bit deeper ! ``` Okay, let's get some information about these expression distributions. ```{r general_count_table, eval=TRUE, echo=TRUE} # general_count_table ## Store counts in a variable counts <- base::as.data.frame( # The matrix to reformat into a dataframe x = SeuratObject::GetAssayData( object = sobj, assay = "RNA", layer = "data") ) ## Rename cells with cell harmony cluster base::colnames(counts) <- paste( # The names of the cell cluster for each cell sobj$CC_Seurat_Phase, # The names of the cells themselves colnames(sobj), sep = "_" ) ``` Let's check our `counts` variable : ```{r general_count_table_display, eval=TRUE, echo=FALSE} # general_count_table_display DT::datatable(head(counts)) ```

We have `r length(colnames(sobj[, sobj$CC_Seurat_Phase == "G1"]))` cells within the `G1 group` : ```{r Plac8_summaries_G1, eval=TRUE} # Plac8_summaries_G1 countsG1 <- select(counts, matches("^G1.")) Plac8G1 <- base::as.data.frame(base::t(countsG1["Plac8", ])) base::summary(Plac8G1) ```

We have `r length(colnames(sobj[, sobj$CC_Seurat_Phase == "S"]))` cells withing the `S group` : ```{r Plac8_summaries_S, eval=TRUE} # Plac8_summaries_S countsS <- select(counts, matches("^S.")) Plac8S <- base::as.data.frame(base::t(countsS["Plac8", ])) base::summary(Plac8S) ```

### From biology to statistics Okay, let us resort on statistics to evaluate our chances to be guess correctly, or our risks to guess wrong. We have lots of observations : - `r base::length(base::colnames(sobj[, sobj$CC_Seurat_Phase == "G1"]))` cells within the G1 phase - `r base::length(base::colnames(sobj[, sobj$CC_Seurat_Phase == "S"]))` cells within the S phase Statisticians really like to have **a lot of observations** ! Ideally, statisticians never have enough observations, but at least they prefer when they have **more than tests**. Here, we have a total of `r base::length(base::colnames(sobj[, sobj$CC_Seurat_Phase == "G1"])) + base::length(base::colnames(sobj[, sobj$CC_Seurat_Phase == "S"]))` observations and we are testing a single gene. We live in a statistician's wet dream ! Now, if we think like a statistician, we can ask ourselves : - Are our cells supposed to be interacting with each others ? - Or are they independent from each others ? - Does the expression in our cells follow a normal distribution ? All of this is very important, and usually, it requires a discussion. For the normal distribution, there is an easy check. Let's draw the expression like we did above. First, we use the function [`rbind`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/cbind) from `base` package. Be careful, the function `rbind` also exists in `DelayedArray`, `data.table`, and `BiocGenerics` packages ! We want to use the "basic" one. ```{r distribution_Plac8_table_build, eval=TRUE, echo=TRUE} # distribution_Plac8_table_build ## Add column idientifiers in each count dataframes Plac8G1$phase <- "G1" Plac8S$phase <- "S" ## Paste the rows beneith each other Plac8 <- base::rbind( ## variable pointing to G1 counts Plac8G1, ## variable pointing to S counts Plac8S, ## A Boolean, requesting that strings/characters ## should not be casted as `logical`. It breaks graphs. stringsAsFactors = FALSE ) ``` Secondly, we use the function [`gghistogram`](https://www.rdocumentation.org/packages/ggpubr/versions/0.6.0/topics/gghistogram) from the package `ggpubr` in order to display relative abundance of gene expression : ```{r distribution_Plac8_display, eval=TRUE, echo=TRUE} # distribution_Plac8_display ggpubr::gghistogram( Plac8, x = "Plac8", y = '..density..', fill = "steelblue", bins = 15, add_density = TRUE ) ``` Our distribution doesn't seem to be normal, nor binomial ... we will have to rely on **non-parametric** tests. Let's run a non-parametric test based on the mean of distributions, since it's the clothest to our 'intuitive' approach. Let there be [Wilcoxon](https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test) test. In R, it's quite straightforward : we have the function [`wilcoxon_test`](https://www.rdocumentation.org/packages/rstatix/versions/0.7.1) to perform the test, then we can plot the result. ```{r wilcoxon_Plac8, eval = TRUE, echo=TRUE} # wilcoxon_Plac8 ## On the expression table stored in the varialbe `Plac8`, ## first apply the function `wilcox_test` from package `rstatix`, ## then we apply the function `add_significance` from package `rstatix` stat.test <- Plac8 %>% rstatix::wilcox_test(Plac8 ~ phase) %>% rstatix::add_significance() # While doing so, we usually also compute effect size eff.size <- Plac8 %>% rstatix::wilcox_effsize(Plac8 ~ phase) ``` ```{r display_wilcoxon_Plac8_result, echo = FALSE, eval = TRUE} # display_wilcoxon_Plac8_result DT::datatable(stat.test, caption = "Wilcoxon test result") stat.test <- Plac8 %>% rstatix::t_test(Plac8 ~ phase) %>% rstatix::add_significance() ``` Wilcoxon test says: the distributions are different, with a `r stat.test$p * 100` % of risks of being wrong. The gene `Plac8` can safely be said **differentially expressed**. We can compute a fold change and conclude. **NOTE** : Just out of curiosity, be aware that [`t_test`](https://www.rdocumentation.org/packages/rstatix/versions/0.7.2/topics/t_test) from `rstatix` package, gives the same answer (with different significance). However, [`DESeq`](https://satijalab.org/seurat/reference/findmarkers) gives a p-value of 0.05 and an adjusted p-value equal to 1 !
Depending on the test used, this gene is, or is not differentially expressed. ## Conclusion In conclusion, to select your DEA method, use the following rules of thumb : 1. If you have already done a study with one of these methods, keep using the same. This is crutial if you ever want to compare your new result with the old ones. 2. If you want to compare your study with a published one, use the same method. 3. If you have no idea, use Wilcoxon. 4. If you have **bulk** data analyzed with `DESeq2`/`Limma`, use `DESeq2`/`Limma`. It will be easier to take both works in consideration. **Please, never ever use a simple Wilcoxon on bulk RNA-Seq data !!** ## NB MAST DE MAST uses a hurdle model divided into two parts: - Discrete – models the probability that a feature is detected in a cell (present or not). - Continuous – models the level of expression for the feature in cells where it is detected. MAST also uses the Cellular Detection Rate (CDR) as a covariate to adjust for sequencing effects, calculated as the fraction of detected genes per cell (nFeature barcode >0 / nHVG). It is also possible to include additional covariates, such as batch or experimental metrics, when analyzing more than two samples. # Select a dataset ## Dataset depends on selected method There it is quite easier: ```{r choose_counts, eval=TRUE, results='asis', echo=FALSE} choose_counts <- as.data.frame(t(data.frame( wilcox = c("Normalized counts", "sojb@assays[['RNA']]@data"), t_test = c("Normalized counts", "sojb@assays[['RNA']]@data"), ROC = c("Normalized counts", "sojb@assays[['RNA']]@data"), ANOVA = c("Normalized counts", "sojb@assays[['RNA']]@data"), MAST = c("Raw counts", "sojb@assays[['RNA']]@counts"), DESeq2 = c("Raw counts", "sojb@assays[['RNA']]@counts"), Limma = c("Raw counts", "sojb@assays[['RNA']]@counts") ))) colnames(choose_counts) <- c("Counts", "slot name") DT::datatable(choose_counts, caption = "How to select your counts") ``` ## FindMarkers With the function [`FindMarkers`](https://satijalab.org/seurat/reference/findmarkers) from package `Seurat`, we want to make three groups: 1. One using `wilcoxon` to perform DEA between clusters "8" and "10". 1. One using `t`-test to perform DEA between clusters "8" and "10". 1. One using `MAST` to perform DEA between "8" and "10". We will observe the results and compare our gene lists. Hey, why are you looking at me? It's your turn to work! Use the all the notions seen above to select the right counts (`slot`), the right input object, and the right arguments. 10 minutes practice !

Answers

Here are the code for each team: ```{r findmarkers_all_de, echo=TRUE, eval=TRUE} sobj_wilcox <- Seurat::FindMarkers( # The variable that contains Seurat Object object = sobj, # Name of condition 1 ident.1 = "8", # Name of condition 2 ident.2 = "10", # Factor name in the Seurat Object group.by = "HarmonyStandalone_clusters", # Differential analysis method test.use = "wilcox" ) sobj_t <- Seurat::FindMarkers( # The variable that contains Seurat Object object = sobj, # Name of condition 1 ident.1 = "8", # Name of condition 2 ident.2 = "10", # Factor name in the Seurat Object group.by = "HarmonyStandalone_clusters", # Differential analysis method test.use = "t" ) sobj_mast <- Seurat::FindMarkers( # The variable that contains Seurat Object object = sobj, # Name of condition 1 ident.1 = "8", # Name of condition 2 ident.2 = "10", # Factor name in the Seurat Object group.by = "HarmonyStandalone_clusters", # Differential analysis method test.use = "MAST" ) ``` ```{r load_dea, eval=TRUE, echo=FALSE} base::saveRDS(file="sobj_wilcox.RDS", object=sobj_wilcox) ```

In the function argument, there is a FoldChange threshold. Should we filter gene expression based on FoldChange? In case of positive answer, how much should that threshold be?

Answer

About thresholds on FDR (False Discovery Rate) and Log2(FC) (Log of the Fold Change), there are many discussions. Remember, here the threshold on Fold Change is Logged. A `log2(1) = ``r log2(1)`. And keep in mind the following: 1. If one selects a fold change threshold above 1.7, then their study concludes that smoking is not related to lung cancer. 1. If one selects a fold change threshold above 1, then their study concludes that a fast-food based diet does not lead to weight gain. 1. If one selects a fold change threshold above 1/25 000 000, then their study concludes: it is a complete hazard that mice have fetal malformation when in presence of Bisphanol. In conclusion, there one, and only one reason to filter on fold change: in my experience, a fold change below 0.7 will be hard to see/verify on wet-lab (qRT). If you need to reduce a too large number of differentially expressed genes, then reduce the FDR to 0.01, or even better, to 0.001. With that, you reduce your number of false claims.

Can you help me with `DEseq2`? When I run the following command line, I have an error : ```{r seurat_run_deseq_with_error, eval=FALSE, echo=TRUE} sobj_deseq2 <- Seurat::FindMarkers( # The variable that contains Seurat Object object = sobj, # Name of condition 1 ident.1 = 8, # Name of condition 2 ident.2 = 10, # Factor name in the Seurat Object group.by = "HarmonyStandalone_clusters", # Differential analysis method test.use = "deseq2", # Use non-normalized data with DESeq2 slot = "counts" ) ``` > Error in PerformDE(object = object, cells.1 = cells.1, cells.2 = cells.2, : > Unknown test: deseq2

Answer

Oh! My fault, it was a typo in my command! Thank you all for your help! ```{r build_count_p1_matrix, eval=TRUE, echo=TRUE} sobj_deseq2 <- Seurat::FindMarkers( # The variable that contains Seurat Object object = sobj, # Name of condition 1 ident.1 = "8", # Name of condition 2 ident.2 = "10", # Factor name in the Seurat Object group.by = "HarmonyStandalone_clusters", # Differential analysis method test.use = "DESeq2", # Use non-normalized data with DESeq2 slot = "counts" ) ``` Remark: by doing surch modification, some fold changes have been modified: remember the gene Atad2 with a mean expression of 0.08 in G1, and 0.2 in S phases? Mean expressions are now around 1.08 for G1, and 1.2 for S phases. This may be the reason why it was not differentially expressed in DESeq2, while Wilcoxon and T-test returned adjusted pvalues far below 0.05.

```{r save_de_results, eval=TRUE, echo=FALSE} base::saveRDS(sobj_wilcox, "sobj_wilcox.RDS") base::saveRDS(sobj_t, "sobj_t.RDS") base::saveRDS(sobj_roc, "sobj_roc.RDS") base::saveRDS(sobj_deseq2, "sobj_deseq2.RDS") ``` # Explore results ## Big tables Let us have a look at the results: ```{r sobj_w_res_display, eval=TRUE, echo=FALSE} DT::datatable( head(sobj_wilcox, n = 10), caption = "Wilcoxon test results" ) ``` We have the following columns: 1. `p_val`: Ignore this column. Always ignore raw p-values. Look at corrected ones, and if they are missing, then compute them. 1. `avg_log2FC`: Average Log2(FoldChange). Illustrates how much a gene is differentially expressed between samples in each condition. 1. `pct.1`: Percent of cells with gene expression in condition one, here in cluster 8. 1. `pct.2`: Percent of cells with gene expression in condition two, here in cluster 10. 1. `p_val_adj`: The confidence we have in the result. The closer to 0, the lesser is the risk of error. ```{r sobj_t_res_display, eval=TRUE, echo=FALSE} DT::datatable( head(sobj_t, n = 10), caption = "T-test results" ) ``` We have the following columns: 1. `p_val`: Ignore this column. Always ignore raw p-values. Look at corrected ones, and if they are missing, then compute them. 1. `avg_log2FC`: Average Log2(FoldChange). Illustrates how much a gene is differentially expressed between samples in each condition. 1. `pct.1`: Percent of cells with gene expression in condition one, here in cluster 8. 1. `pct.2`: Percent of cells with gene expression in condition two, here in cluster 10. 1. `p_val_adj`: The confidence we have in the result. The closer to 0, the lesser is the risk of error. ```{r sobj_roc_res_display, eval=TRUE, echo=FALSE} DT::datatable( head(sobj_mast, n = 10), caption = "MAST test results" ) ``` We have the following columns: 1. `p_val`: Ignore this column. Always ignore raw p-values. Look at corrected ones, and if they are missing, then compute them. 1. `avg_log2FC`: Average Log2(FoldChange). Illustrates how much a gene is differentially expressed between samples in each condition. 1. `pct.1`: Percent of cells with gene expression in condition one, here in cluster 8. 1. `pct.2`: Percent of cells with gene expression in condition two, here in cluster 10. 1. `p_val_adj`: The confidence we have in the result. The closer to 0, the lesser is the risk of error. ```{r sobj_deseq_res_display, eval=TRUE, echo=FALSE} DT::datatable( head(sobj_t, n = 10), caption = "T-test results" ) ``` We have the following columns: 1. `p_val`: Ignore this column. Always ignore raw p-values. Look at corrected ones, and if they are missing, then compute them. 1. `avg_log2FC`: Average Log2(FoldChange). Illustrates how much a gene is differentially expessed between samples in each condition. 1. `pct.1`: Percent of cells with gene expression in condition one, here in cluster 8. 1. `pct.2`: Percent of cells with gene expression in condition two, here in cluster 10. 1. `p_val_adj`: The confidence we have in the result. The closer to 0, the lesser is the risk of error. ## Filter DEA results What kind of threshold should be used to filter each results? ```{r extract_de_genes, eval=TRUE, echo=FALSE} # We store a `list` in a variable called `data` # The function `list` comes from `base` and not `biocGenerics`. data <- base::list( # We use a threshold of 5% on adjusted p-values wilcox = base::rownames(sobj_wilcox[sobj_wilcox$p_val_adj <= 0.05, ]), # We use a threshold of 5% on adjusted p-values t_test = base::rownames(sobj_t[sobj_t$p_val_adj <= 0.05, ]), # We use a threshold of 0.2 in AUC power mast = base::rownames(sobj_mast[sobj_roc$power >= 0.2, ]), # We use a threshold of 5% on adjusted p-values deseq2 = base::rownames(sobj_deseq2[sobj_deseq2$p_val_adj <= 0.05, ]) ) ``` > If we must label certain scores as good or bad, we can reference the following rule of thumb: > > 0.5 = No discrimination > 0.5-0.7 = Poor discrimination > 0.7-0.8 = Acceptable discrimination > 0.8-0.9= Excellent discrimination > 0.9 = Outstanding discrimination Hosmer and Lemeshow in Applied Logistic Regression (p. 177) ## Add results to Seurat objects We'd like to store the results of differential expression analysis in the `Seurat` object. ```{r add_results_seurat, echo=TRUE, eval=TRUE} sobj@misc$wilcox <- sobj_wilcox ``` ```{r save_sobjw, eval=TRUE, echo=FALSE} base::saveRDS(sobj, "DEA_Scaled_Normalized_Filtered.RDS") ``` ## Common results Now we can plot intersections in an up-set graph. It is like a Venn diagram: ```{r upset_seurat_de_methods, eval=TRUE, echo=TRUE} UpSetR::upset( data = UpSetR::fromList(data), order.by = "freq" ) ``` ## Others functions in Seurat for DEA There are two other functions in Seurat package to do DEA in different context. For example, you will see the difference between one cluster versus many of them. You could be use `FindAllMarkers`. Usage is equivalent to `FindMarkers` but it need to indicate which variable cluster variable. ```{r findallmarkers_dea} sobj_findallmarkers_mast <- FindAllMarkers(sobj, assay="RNA", group.by="HarmonyStandalone_clusters", test.use = "MAST", only.pos = T, logfc.threshold = 0.9, min.pct = 0.1, random.seed = my_seed) DT::datatable( head(sobj_findallmarkers_mast, n = 10), caption = "MAST results" ) p <- EnhancedVolcano( toptable = sobj_findallmarkers_mast, lab = rownames(sobj_findallmarkers_mast), x = "avg_log2FC", y = "p_val_adj", FCcutoff = 0.2, drawConnectors = FALSE, title = 'Volcano by cluster' ) p + facet_wrap(~ cluster) ``` Another function is `FindConservedMarkers`. It the same function of `FindMarkers` but this usage is to identify the same features expression between clusters ```{r findconservedmarkers_dea} sobj_conservedmarkers_mast <- Seurat::FindConservedMarkers( # The variable that contains Seurat Object object = sobj, # Name of condition 1 ident.1 = "8", # Name of condition 2 ident.2 = "10", # Factor name in the Seurat Object grouping.var = "orig.ident", # Differential analysis method assay="RNA", slot="counts" ) DT::datatable( head(sobj_conservedmarkers_mast, n = 10), caption = "MAST results" ) ``` 1. `TDCT_p_val`: Ignore this column. Always ignore raw p-values. Look at corrected ones, and if they are missing, then compute them. 1. `TDCT_avg_log2FC`: Average Log2(FoldChange). Illustrates how much a gene is differentially expessed between samples in TDCT condition. 1. `TDCT_pct.1`: Percent of cells with gene expression in condition one, here in cluster 8 in TDCT condition. 1. `TDCT_pct.2`: Percent of cells with gene expression in condition two, here in cluster 10 in TDCT condition. 1. `TDCT_p_val_adj`: The confidence we have in the result. The closer to 0, the lesser is the risk of error in TDCT condition. 1. `TD3A_p_val`: Ignore this column. Always ignore raw p-values. Look at corrected ones, and if they are missing, then compute them. 1. `TD3A_avg_log2FC`: Average Log2(FoldChange). Illustrates how much a gene is differentially expessed between samples in TD3A condition. 1. `TD3A_pct.1`: Percent of cells with gene expression in condition one, here in cluster 8 in TD3A condition. 1. `TD3A_pct.2`: Percent of cells with gene expression in condition two, here in cluster 10 in TD3A condition. 1. `TD3A_p_val_adj`: The confidence we have in the result. The closer to 0, the lesser is the risk of error in TD3A condition. 1. `max_pval`: Maximum of p_val between comparison TDCT vs TD3A. It must be low to indicate a conserved feature 1. `minimump_p_val`: Minimum of p_val between comparison TDCT vs TD3A. ```{r volcano_plac8 _by_sample} Seurat::VlnPlot( # A subset of the Seurat object # limited to clusters 8 and 10, # or else we will plot all the clusters object = subset(sobj, HarmonyStandalone_clusters %in% c("8", "10")), # The name of the gene of interest (feature = gene) features = "Plac8", # The name of the Seurat cell annotation split.by = "orig.ident", # Change color for presentation cols = c("blue", "red") ) sobj_conservedmarkers_mast$direction = case_when( sign(sobj_conservedmarkers_mast$TDCT_avg_log2FC) == sign(sobj_conservedmarkers_mast$TD3A_avg_log2FC) & sign(sobj_conservedmarkers_mast$TDCT_avg_log2FC) == -1 ~ "negative", sign(sobj_conservedmarkers_mast$TDCT_avg_log2FC) == sign(sobj_conservedmarkers_mast$TD3A_avg_log2FC) & sign(sobj_conservedmarkers_mast$TDCT_avg_log2FC) == 1 ~ "positive", .default = "opposite") t_dataframe <- table(sobj_conservedmarkers_mast$direction,sobj_conservedmarkers_mast$max_pval<0.05)[,2] levels_direction <- paste0(names(t_dataframe)," (nFeature=",t_dataframe,")") sobj_conservedmarkers_mast_filtered = sobj_conservedmarkers_mast[sobj_conservedmarkers_mast$minimump_p_val<0.05,] sobj_conservedmarkers_mast_filtered$direction = factor(sobj_conservedmarkers_mast_filtered$direction, levels=c("negative","opposite","positive"), labels=levels_direction) ggplot() + geom_point(data=sobj_conservedmarkers_mast_filtered, aes(x = TDCT_avg_log2FC, y = TD3A_avg_log2FC, color = direction)) + scale_color_manual(values=c("lightblue","grey70","red")) + theme_bw() + labs(x = "TDCT Average Log2 FoldChange cluster 8 vs cluster 12", y = "TD3A Average Log2 FoldChange cluster 8 vs cluster 12", title = "Result of Wilcoxon Test between cluster 8 vs cluster 12 between condition TDCT vs TD3A", color = "Direction Log2FoldChange") ``` ## Heatmap We'd like to display the expression of genes identified by FindMarkers. Then we use the function [`DoHeatmap`](https://satijalab.org/seurat/reference/doheatmap) from the package `Seurat`. In order to limit the graph to differentially expressed reads, we use the function [`rownames`](https://rdocumentation.org/packages/base/versions/3.6.2/topics/row.names) from R `base` package on the DEA result table. In this example, I use the results of wilcoxon, but you shall use any of the results you previously obtained. ```{r seurat_heatmap, eval=TRUE, echo=TRUE} Seurat::DoHeatmap( # variable pointing to seurat object object = sobj, # name of DE genes features = base::rownames(sobj_wilcox), # Cluster annotation group.by = "HarmonyStandalone_clusters", ) ``` ## Volcano plot A Volcano plot is usefull to identify differnetial expression analysis bias. The package `EnhancedVolcano` has an [eponym](https://bioconductor.org/packages/3.17/bioc/html/EnhancedVolcano.html) function for that: ```{r enhanced_volcanoplot, eval=TRUE, echo=TRUE} EnhancedVolcano::EnhancedVolcano( # variable pointing to the DEA results toptable = sobj_wilcox, # Gene names lab = rownames(sobj_wilcox), # Column in which to find Fold Change x = "avg_log2FC", # Column in which to find confidence interval y = "p_val_adj", # Lower fold-change cut-off FCcutoff = 0.2 ) ``` ## Session Info This list of all packages used while you work should be included in each en every R presentation: ```{r session_info, eval=TRUE, echo=TRUE} utils::sessionInfo() ```