--- title: "Single-Cell Transcriptomics and Epigenetics: Theory and Practice (sincellTE 2022)" author: "Agnes Paquet, Nicolas Nottet, Syneos Health Translational Sciences" date: "2022-01-11" output: html_document: keep_md: yes number_sections: yes smart: no toc: yes toc_float: yes editor_options: chunk_output_type: console --- ![](baniere_titre.png) ```{js, echo=FALSE} $(document).ready(function() { $chunks = $('.fold'); $chunks.each(function () { // add button to source code chunks if ( $(this).hasClass('s') ) { $('pre.r', this).prepend("
Show Source

"); $('pre.r', this).children('code').attr('class', 'folded'); } // add button to output chunks if ( $(this).hasClass('o') ) { $('pre:not(.r)', this).has('code').prepend("
Show Output

"); $('pre:not(.r)', this).children('code:not(r)').addClass('folded'); // add button to plots $(this).find('img').wrap('
');
      $('pre.plot', this).prepend("
Show Plot

"); $('pre.plot', this).children('img').addClass('folded'); } }); // hide all chunks when document is loaded $('.folded').css('display', 'none') // function to toggle the visibility $('.showopt').click(function() { var label = $(this).html(); if (label.indexOf("Show") >= 0) { $(this).html(label.replace("Show", "Hide")); } else { $(this).html(label.replace("Hide", "Show")); } $(this).siblings('code, img').slideToggle('fast', 'swing'); }); }); ``` ```{css, echo=FALSE} .showopt { background-color: #004c93; color: #FFFFFF; width: 100px; height: 20px; text-align: center; vertical-align: middle !important; float: right; font-family: sans-serif; border-radius: 8px; } .showopt:hover { background-color: #dfe4f2; color: #004c93; } pre.plot { background-color: white !important; } ``` ```{r setup, cache = F, echo=FALSE, include=FALSE} knitr::opts_chunk$set(error = TRUE) Sys.setenv("LANGUAGE"="EN") require(tidyverse) ``` For this session, we will be using the R packages Seurat and SingleR to explore to scRNASeq data and use various reference databases for automated cell type annotations. We will use a subset of the lung samples from the Mouse Cell Atlas as example. The datasets are available in the folder: /shared/projects/sincellte_2022/Courses/Secondary_analyses/input/cell_annotation We will use a subset of 2 lung samples from the Mouse Cell Atlas (MCA). I have used the standard Seurat pipeline to process the samples. The R code used is presented below, so that you reproduce the workflow during open analysis. For this practical session, we will load the Seurat object mca from the the mca.rds file located here: /shared/projects/sincellte_2022/Courses/Secondary_analyses/input/cell_annotation/mca.rds ```{r} library(Seurat) library(pheatmap) library(dplyr) #Change your working directory to the course directory user.id <- Sys.getenv('USER') setwd(paste0("/shared/projects/sincellte_2022/", user.id, "/Secondary_analyses")) ``` # Data analysis using R package Seurat The new version of Seurat allows to store data from multiple modalities, such as RNAseq, spatial transcriptomics, ATAC-seq...For our session today, we will focus on RNAseq only. The various assays are stored in the assays slot. ## Read in the data and create a Seurat object The data is stored in the MCA folder. We will first need to read in the count data for each sample, and format it for input into Seurat: - Read in the file. - Create a dataframe for the cell labels - Remove the cell labels from the count table - Transpose the data to have genes in row and cell barcodes in columns ```{r} lung1 = read.delim("/shared/projects/sincellte_2022/Courses/Secondary_analyses/input/cell_annotation/MCA/MCA_lung_S1_filter.txt", stringsAsFactors = F, row.names = 1) lung1.t = t(lung1) head(lung1.t[,1:10]) head(lung1[,1:5]) labels.1 = lung1[,"labels"] names(labels.1) = rownames(lung1) table(labels.1) lung1.t = t(lung1[,-1]) ``` Now, we can create a new Seurat object with our lung count data, and add the cell labels in the meta.Data slot. Filtering step: keep cells with at least 100 genes, and genes expressed in at least 5 cells. Visually inspect your data. ```{r} l1 = CreateSeuratObject(count=lung1.t, project="MCA_lung1", min.cells=5, min.features =100) l1$sample <- "S1" l1$cell.label <- labels.1 head(l1@meta.data) VlnPlot(object = l1, features = "nCount_RNA") VlnPlot(object = l1, features = "nFeature_RNA") median(l1$nCount_RNA) ``` Repeat the same process for the second lung sample. ```{r} lung2 = read.delim("/shared/projects/sincellte_2022/Courses/Secondary_analyses/input/cell_annotation/MCA/MCA_lung_S2_filter.txt", stringsAsFactors = F, row.names = 1) labels.2 = lung2[,"labels"] names(labels.2) = rownames(lung2) table(labels.2) lung2.t = t(lung2[,-1]) l2 = CreateSeuratObject(count=lung2.t, project="MCA_lung2", min.cells=5, min.features =100) l2$sample <- "S2" l2$cell.label <- labels.2 head(l2@meta.data) VlnPlot(object = l2, features = "nCount_RNA") VlnPlot(object = l2, features = "nFeature_RNA") median(l2$nCount_RNA) ``` ## Merge the samples into a single Seurat object Merge the 2 samples using the Merge function in Seurat, and visually inspect the data to look for sample bias. - How many cells to we have in each sample? ```{r} mca = merge(l1,l2) table(mca$sample) VlnPlot(mca, "nCount_RNA", group.by="sample") VlnPlot(mca, "nFeature_RNA", group.by="sample") ``` ## Normalization The median UMI count in this data is low. We will normalize the data to the median UMI counts. Look at the parameters to verify that the normalization was performed. ```{r} mca <- NormalizeData(mca, scale.factor=median(mca$nCount_RNA)) mca@commands ``` ## Downstream analysis Now, we will run the complete analysis using defaults parameters. The steps will be: 1. Feature reduction using the FindVariableFeature function 2. Center and scale the data using the ScaleData function 3. Run PCA, select the number of PCs to keep 4. Cluster the cells in the PCA space 5. Generate the tSNE or UMAP graph First, we select the most variable feature using the FindVariableFeatures. How many genes did we select? What are the top 10 most variable genes? Can you highlight them on a Variance vs Average expression plot? ```{r, fig.width=12} mca <- FindVariableFeatures(mca) length(VariableFeatures(mca)) # Identify the 10 most highly variable genes top10 <- head(VariableFeatures(mca), 10) # plot variable features with and without labels plot1 <- VariableFeaturePlot(mca) plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE) plot2 ``` Center and scale the data, then apply a PCA. Look at the data in the PCA space: do you see any sample bias? Here, we could also regress out confounding variables such as cell cycle score if needed. ```{r} mca <- ScaleData(mca) mca <- RunPCA(mca) DimPlot(mca, reduction="pca",group.by="sample") ``` Look at the Elbow plot to estimate the number of PCs to keep. ```{r} ElbowPlot(object = mca) ``` We will keep 6 PCs for downstream analysis. Finally, run the clustering algorithm, compute the tSNE/UMAP, and explore the results. Do you think we need to apply additional sample integration? You can use the following markers to explore the different clusters: Sftpb, Cav1, Pdpn, Foxj1, Scgb1a1, Cd74, Marco, Cd8b1 ```{r} mca <- FindNeighbors(object = mca, dims = 1:6,do.plot = TRUE) mca <- FindClusters(object = mca) mca <- RunUMAP(object = mca, dims = 1:6) DimPlot(mca) DimPlot(object = mca, group.by="sample") FeaturePlot(mca, features="nCount_RNA") FeaturePlot(mca, c("Sftpb","Cav1")) FeaturePlot(mca,c("Pdpn", "Scgb1a1")) FeaturePlot(mca,c("Cd79a", "Marco")) VlnPlot(mca, c("Foxj1","Scgb1a1")) ``` ## Differential Analysis Let's find differentially expressed genes between our clusters. We will use the FindAllMarkers function in Seurat. This function offers several statistical tests for this task. The default is a Wilcoxon Rank Sum test. The result tables contains statistical results such as p-values and logFC, but also the % of cells in each group expressing the markers. Checking the % of detection of the markers in a cluster can help you select genes specifically expressed in a cell type, and not in others. ```{r} markers <- FindAllMarkers(mca) head(markers) ``` Then, we can select the top markers for each cluster and generate a heatmap based on these genes. ```{r} top.markers <- markers %>% group_by(cluster) %>% top_n(n = 5, wt = avg_log2FC) DoHeatmap(mca, features = top.markers$gene) + NoLegend() ``` If you have very small clusters, it is informative to generate a heatmap based on the average profiles to look at the top markers. ```{r} ave.exp = AverageExpression(mca, return.seurat = T) DoHeatmap(ave.exp,features = top.markers$gene) + NoLegend() ``` What can you conclude on your clustering based on this heatmap? Let's compare your clustering results to the original cell labels. ```{r} table(mca$RNA_snn_res.0.8, mca$cell.label) ``` We can now either repeat the clustering steps with different parameters to avoid over clustering, or we could group the cluster manually and assign labels based on the known marker genes. We can also use automated annotation tools to help us for cell annotations if needed. Let's use the cell identity provided with the dataset as cluster labels and repeat the analysis. ```{r} Idents(mca) <- mca[["cell.label"]] table(Idents(mca)) cells.markers <- FindAllMarkers(mca) cells.top.markers <- cells.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_log2FC) DoHeatmap(mca, features = cells.top.markers$gene) + NoLegend() saveRDS(mca, file="mca.rds") ``` ## Exploratory data analysis Let's load our processed dataset and generate a couple of graphs to understand what Take some time to explore the differences between cell types using graphs. Take a close look at the number of genes and the number of molecule in each cell type. Do you see any difference in cell type proportions between the 2 samples? ```{r} mca <- readRDS(file="/shared/projects/sincellte_2022/Courses/Secondary_analyses/input/cell_annotation/mca.rds") head(mca@meta.data) table(mca$sample) DimPlot(mca, label = T) DimPlot(mca, group.by="sample") ## Calculate the % of each cell type by sample tt = table(Original.Label=mca$cell.label, Sample=mca$sample) round(sweep(tt*100,2,as.numeric(colSums(tt)), FUN="/"),1) DimPlot(mca, split.by = "sample") ## create a new variable combining sample and cell type for EDA ## Inspect known marker genes distribution to verify specificity VlnPlot(mca, "nFeature_RNA") VlnPlot(mca, "nCount_RNA") VlnPlot(mca, c("Sftpb", "Foxj1", "Cd79a"), group.by="cell.label") ``` # Automated cell type annotation using SingleR Determining cell types can be challenging. There are now several tools available to help you with this task. We will use the package SingleR to perform automated annotations of our cells. Several reference are provided in the package celldex. If these references do not correspond to your biological system, you can also create your own reference set to use in SingleR (from single cell or bulk RNAseq). ## ImmGenData Let's look at the Mouse reference dataset that can be of interest for us: ImmGenData This reference contains normalized genes expression profiles of murine cell populations that were sorted and analyzed using microarrays. First, we will load the ImmGenData. Several levels of annotations are provided: one broad level level.main, and one precise level label.fine. Cell ontology ids are also available. ```{r} library(SingleR) library(celldex) immgen.ref <- ImmGenData() immgen.ref table(immgen.ref$label.main) head(sort(table(immgen.ref$label.fine), decreasing = T),20) ``` This package provides a single function for cell type annotations. It takes as input a count matrix, or a SingleCellExperiment object. For test data, we can provide either raw or normalized count data, since singleR is based on Spearman correlation. If you are using a full-length protocol (such as Smart-Seq), and you want to annotate your data using a reference obtained using UMI-based protocol, it is recommended to adjust your count data for gene length and provide TPMs to SingleR. ```{r} immgen.pred = SingleR(GetAssayData(mca, slot="data"), ref=immgen.ref, label=immgen.ref$label.main) colnames(immgen.pred) sort(table(immgen.pred$labels), decreasing=TRUE) ``` The package provides nice graphical output to assess the quality of the classification. We can easily generate a heatmap of correlation scores for each cell against the reference, to look for ambiguous assignments. ```{r} plotScoreHeatmap(immgen.pred) ``` We can also display cell meta data at the top of the graph, to look for batch effects etc... ```{r} my.labels = data.frame(Sample=mca@meta.data[,c("sample")]) rownames(my.labels) = rownames(mca@meta.data) plotScoreHeatmap(immgen.pred, annotation_col=my.labels) ``` Let's check how this cell type annotation compares with the original labels. ```{r} res.tab = table(mca$cell.label,immgen.pred$labels) res.tab pheatmap(log10(res.tab+1)) ``` ## Using your own reference As we can see in the previous example, existing databases may not match your data very well. We will see in this part how to build you own reference, either from your own data that you have been able to manually annotate previously, or from a public repository. Let's start by testing if we can annotate Sample 2 against the other one. First, we will create the reference annotation. We need to extract the log normalized counts for Sample1 from the mca object, and we will use the original cell labels as cell type annotations. Once our reference ready, we annotate sample 2. Inspect quality metrics and graphs to check if the cell type predictions are good. ```{r} sample1 = subset(mca, subset = sample=="S1") Idents(sample1) = sample1$cell.label sample1.norm = GetAssayData(sample1, slot="data") sample2 = subset(mca, subset = sample=="S2") sample2.pred = SingleR(as.SingleCellExperiment(sample2), ref = sample1.norm, labels = sample1$cell.label) plotScoreHeatmap(sample2.pred) table(NewLabel=sample2.pred$labels, Original.Label=sample2$cell.label) ``` ## Using a references from a different technology Our dataset is a subset of the Mouse Cell Atlas, generated using MicroWell-Seq technology. In this exercise, we will try to annotate our MCA sample using the Tabula Muris cell atlas. I have extracted the lung data generated using the Smart-Seq2 technology from the Tabula Muris website. We can now read in the data, generate log normalized counts using the scuttle package, and use the normalized data as reference set. ```{r} tab.muris = read.csv("/shared/projects/sincellte_2022/Courses/Secondary_analyses/input/cell_annotation/TabulaMuris/Lung-counts.csv", row.names = 1) head(tab.muris[,1:5]) TM.annot.cells = read.delim("/shared/projects/sincellte_2022/Courses/Secondary_analyses/input/cell_annotation/TabulaMuris/annotations_lung.txt") rownames(TM.annot.cells) = TM.annot.cells$cell head(TM.annot.cells) ## We will keep only cells for which we have an annotations, and align the tables tab.muris = tab.muris[, TM.annot.cells$cell] library(SingleCellExperiment) TM = SingleCellExperiment(list(counts=tab.muris), colData=data.frame(TabulaMuris.SmartSeq=TM.annot.cells[,2])) TM.norm = scuttle::logNormCounts(TM) TM.pred = SingleR(as.SingleCellExperiment(sample1), ref = TM.norm, labels = colData(TM.norm)[,1]) plotScoreHeatmap(TM.pred) TM.res = table(TM.label = TM.pred$labels, Original.label = sample1$cell.label) pheatmap(log10(TM.res+10)) ``` ## Using Azimuth https://azimuth.hubmapconsortium.org/ Azimuth is a web application that can be used to automate the processing, analysis, and interpretation of a new single-cell RNA-seq experiment.The analysis pipeline is based on Seurat, and several Human or Mouse reference dataset are provided. Azimuth analysis results can be downloaded and integrated in your Seurat analysis. I have uploaded our mca data into Azimuth, and used the Human lung atlas reference. The annotation results are available in the "azimuth_pred.tsv" file. You can read in the file and see how it performed, even though it's not the same model organism. ```{r} predictions <- read.delim('/shared/projects/sincellte_2022/Courses/Secondary_analyses/input/cell_annotation/azimuth_pred.tsv', row.names = 1) mca <- AddMetaData( object = mca, metadata = predictions) table(Original.Label = mca$cell.label, Azimuth = mca$predicted.annotation.l1) ``` # __Ressources and References__ Seurat tutorials: https://satijalab.org/seurat/ SingleR book: http://bioconductor.org/books/release/SingleRBook/ Pokedex for Cell Types: http://bioconductor.org/packages/release/data/experiment/vignettes/celldex/inst/doc/userguide.html Azimuth: https://azimuth.hubmapconsortium.org/
```{r} sessionInfo() ```
## Ressources - [Base R cheat sheet](http://github.com/rstudio/cheatsheets/raw/master/base-r.pdf) - [Advanced R cheat sheet](https://www.rstudio.com/wp-content/uploads/2016/02/advancedR.pdf) - [Rstudio cheat sheet](https://github.com/rstudio/cheatsheets/raw/master/rstudio-ide.pdf) - [R markdown cheat sheet](https://github.com/rstudio/cheatsheets/raw/master/rmarkdown-2.0.pdf) - [Regular Expression](https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf)