--- title: "
EBAII n1 2023 : SINGLE CELL ANALYSIS TRAINING
PROCESSING
Data integration and bias correction
" date: "2023-11-08.10" author: - name: "Bastien JOB" - name: "Remi MONTAGNE" email: "bastien.job@gustaveroussy.fr" output: html_document: css: columns.css background: black fig_height: 6 fig_width: 8 highlight: tango ## Theme for the code chunks number_sections: false ## Adds number to headers (sections) theme: flatly ## CSS theme for the HTML page toc: true ## Adds a table of content toc_float: ## TOC options collapsed: true ## By default, the TOC is folded smooth_scroll: true ## Smooth scroll of the HTML page self_contained: true ## Includes all plots/images within the HTML code_download: true ## Adds a button to download the Rmd code_folding: show thumbnails: false lightbox: true fig_caption: false gallery: true use_bookdown: true always_allow_html: true ## Allow plain HTML code in the Rmd --- `r Hmisc::hidingTOC(buttonLabel = 'Show TOC', hidden = TRUE, tocSide = 'left', buttonSide='left', posCollapse = 'margin', levels = 3)` ```{r setup, include=FALSE} # options(width = 60); knitr::opts_chunk$set( echo = TRUE, # Print the code eval = TRUE, # Do not run command lines message = FALSE, # Print messages prompt = FALSE, # Do not display prompt comment = NA, # No comments on this section warning = FALSE, # Display warnings tidy = FALSE, # root.dir = file.path('/', 'shared', 'projects', '2325_ebaii', 'SingleCell'), # results = 'hide' width = 100 # Number of characters per line ) my_seed <- 1337 library(Seurat) ``` # Load the latest Seurat object Start a **Rstudio** session

![](Rstudio.png)

Load the latest Seurat objects saved as RDS. ```{r load data} data_dir <- '/shared/projects/2325_ebaii/SingleCell/' s1 <- readRDS(paste0(data_dir, 'TD_DATA/DATA_START/INTEGRATION/TD3A_LogNorm.RDS')) s2 <- readRDS(paste0(data_dir, 'TD_DATA/DATA_START/INTEGRATION/TDCT_LogNorm.RDS')) ``` Have a look at the description of the data ```{r data description} EBAII.n1.SC.helper::seurat4_descriptor(sobj = s1, describe = "assay") EBAII.n1.SC.helper::seurat4_descriptor(sobj = s2, describe = "assay") ``` # Evaluation of the batch effect First, we will merge (concatenate) the data. ```{r merge datasets} sobj <- merge(s1, s2, merge.data = TRUE) dim(s1) dim(s2) dim(sobj) rm(s1, s2) ``` We will have a look at their low dimensional representation. With the help of the quick visualization function, we will plot a UMAP and color the cells according to their sample of origin. ```{r plot merged datasets} VIZ <- EBAII.n1.SC.helper::QnD_viz(sobj = sobj, assay = 'RNA', slot = 'data', group_by = 'orig.ident', return_object = TRUE) VIZ ``` Cells are clearly separated depending on their sample of origin. There is a batch effect that needs to be corrected. # Integration with Harmony ## Scaling and bias regression Harmony takes as input the result of a first dimension reduction (PCA), thus after the normalization and scaling of the data. Our input Seurat object is already normalized. We will need to scale the new merged object. Scaling is also the good time to try to correct technical sources of biases (regression). Indeed, some technical factors can bias the data: they will affect the cell space structure (seen on the UMAP), the future clustering, cell type annotation by cluster, etc ... These technical factors can be the number of UMIs(`nCount_RNA`), the percent of mitochondrial RNA... All the factors that we see in the **metadata** table *could* be a source of bias. ```{r head metadata} head(sobj@meta.data) ``` We will check if this is the case with a helper function ```{r bias heatmap} EBAII.n1.SC.helper::dimred_covar_cor(sobj = VIZ, dimred = 'pca', ncomp = 20) ``` The heatmap shows the contribution of the technical factors to the PCA dimensions. The strength of this contribution is indicated by the intensity of the blue color. Ideally, if the factor is not a source of bias, its contribution to all dimensions is roughly the same. If the factor contributes a lot to a few dimensions (especially the very first ones), we should consider _testing_ regressing it out. Here, we will need to remove the influence of these technical factors. We say that we regress out some variables. This is done within the `Seurat::ScaleData()` function.
! technical issue: this step takes too much time, we will not run the following command. We will load a scaled object instead, that was prepped for you. ```{r scale and correct biases, eval=FALSE} ## Technical issue: on the cluster it take too much time, we will not run this command ## we will load a preconstructed file instead. vtr = c('nCount_RNA', 'percent_rb') sobj <- Seurat::ScaleData(object = sobj, assay = 'RNA', nfeatures = 2000, vars.to.regress = vtr) sobj <- Seurat::FindVariableFeatures(sobj) saveRDS(sobj, paste0(data_dir, 'TD_DATA/DATA_START/INTEGRATION/object_to_integrate.RDS')) ``` ```{r load scaled data} sobj <- readRDS(paste0(data_dir, 'TD_DATA/DATA_START/INTEGRATION/object_to_integrate.RDS')) ``` ```{r bias heatmap of corrected data} sobj <- Seurat::RunPCA(object = sobj, npcs = 50, seed.use = my_seed, verbose = FALSE) EBAII.n1.SC.helper::dimred_covar_cor(sobj = sobj, dimred = 'pca', ncomp = 20) ``` ## Dimension reduction Harmony works in low dimension. We have just applied PCA to the data. Remind that PCA replaces the thousands of genes by a small number of dimensions that summarize most of the information. For integration, we want to use only the **informative** dimensions (the earliest ones). We will choose them with a elbow plot. ```{r choose number of informative PC} ### Finding the number of dims to use ==== Seurat::ElbowPlot(object = sobj, reduction = 'pca', ndims = 50) ## Result : we will keep 20 dimensions for Harmony KDIM <- 20 ``` This dataset is quite simple in its complexity . Here we see that the 20 first dimensions summarize most of the information (most of the variability). So we will only need to make Harmony work on those 20 dimensions. For more complex datasets, the default value is generally 50, but it needs to be adjusted to each specific dataset. ## Harmony integration We are now ready to apply Harmony to integrated the different datasets. ```{r run Harmony} sobj <- harmony::RunHarmony(object = sobj, reduction.use = 'pca', dims.use = 1:KDIM, group.by.vars = 'orig.ident', max_iter = 100) ``` ## Post-integration control : evaluating the residual effect of covariates We removed the principal sources of biases and integrated the datasets. We can once again look at the heatmap of the effect of the technical factors to make sure that we remove the majority of the bias. **Though**, beware that you did not also remove some biological signal : you may check if your marker genes expression still behave as expected (on a UMAP, by example). ```{r plot harmony integrated data} VIZ <- EBAII.n1.SC.helper::QnD_viz(sobj = sobj, assay = 'harmony', slot = NULL, dimred = 'harmony', ncomp = KDIM, group_by = 'orig.ident', return_object = TRUE) ``` ```{r} EBAII.n1.SC.helper::dimred_covar_cor(sobj = VIZ, dimred = 'harmony', ncomp = 20) ``` # Last thoughts: integration -VS- bias regression We saw earlierly that the columns of the metadata can be seen as potential technical biases to correct. ```{r print metadata} colnames(sobj@meta.data) ``` One of the metadata columns, `orig.ident` indicates the sample of origin, that is to say the information about the batch, here. This is precisely the "bias" that we try to remove when we perform integration. So why can't we just regress out `orig.ident` as any other bias ? Earlierly, we regressed out `nCount_RNA` and `percent_rb` with the command: ```r vtr = c('nCount_RNA', 'percent_rb') sobj <- Seurat::ScaleData(object = sobj, assay = 'RNA', nfeatures = 2000, vars.to.regress = vtr) ``` Why do we need special integration methods, couldn't we just run the command ```r vtr = c('nCount_RNA', 'percent_rb', 'orig.ident') sobj <- Seurat::ScaleData(object = sobj, assay = 'RNA', nfeatures = 2000, vars.to.regress = vtr) ``` Let's have a try. ```{r correct orig.ident} vtr <- 'orig.ident' sobj <- Seurat::FindVariableFeatures(object = sobj, assay = 'RNA', nfeatures = 2000) sobj <- Seurat::ScaleData(object = sobj, assay = 'RNA', vars.to.regress = vtr) VIZ <- EBAII.n1.SC.helper::QnD_viz(sobj = sobj, assay = 'RNA', slot = 'scale.data', group_by = 'orig.ident', return_object = TRUE) EBAII.n1.SC.helper::dimred_covar_cor(sobj = VIZ, dimred = 'pca', ncomp = 20) ``` Correcting only this variable is not enough. ```{r correct orig.ident and other biases} vtr <- c('orig.ident', 'nCount_RNA', 'percent_rb') sobj <- Seurat::FindVariableFeatures(object = sobj, assay = 'RNA', nfeatures = 2000) sobj <- Seurat::ScaleData(object = sobj, assay = 'RNA', vars.to.regress = vtr) VIZ <- EBAII.n1.SC.helper::QnD_viz(sobj = sobj, assay = 'RNA', slot = 'scale.data', group_by = 'orig.ident', return=TRUE) EBAII.n1.SC.helper::dimred_covar_cor(sobj = VIZ, dimred = 'pca', ncomp = 20) ``` This is better than a simple _merging_ of the databut not as good as Harmony. Actually, the batch effect has a very complex structure while the method used for bias correction is quite simple. It is not able to correct the batch effect in a proper manner. It might not align correctly the cell populations, or over-correct the data and align different cell populations ! This is especially true when integrating multiple samples. The integration methods were developed specifically to account for the complexity of the batch effect(s) and give better results. # Save results ```{r save, class.source = "fold-hide"} saveRDS(object = sobj, file = './TD3A_integrated.RDS', compress = 'bzip2') ```


_Rsession_ ```{r} utils::sessionInfo() ```