EB3I n1 2025 scRNAseq
-
INTEGRATION
-





1 PREAMBLE

1.1 Purpose of this session

This file describes the different steps to perform sixth part of data processing for the single cell RNAseq data analysis training course for the EBAII n1 2024, covering these steps :

  • Integration of a pair of samples

1.2 The integration data set

  • We will still use data from the Paiva et al. publication

  • While we focused on the KO sample TD3a until now, we will integrate it with its wild-type counterpart TDCT

  • TDCT was processed in a very similar manner (QC, doublers, filtering, LogNormalization, HVGs and scaling) to what you just performed for TD3A

  • Both samples were halved for their number of cells, for practical reasons in the context of this training (computation speed)

  • As the purpose is to integrate both samples into a single analytical entity, the objects we will use as input are Seurat objects up to the data scaling and regression (thus, there are no dimension reduction data in these)



2 Start Rstudio

3 Warm-up

  • We set common parameters we will use throughout this session :
## Seed for the RNG (our sole parameter for this session)
my_seed <- 1337L

## Required for our big objects
options(future.globals.maxSize = 1E+09)
library(Seurat)
library(ggplot2)
library(patchwork)
library(dplyr)


4 Prepare the data structure

We will do the same as for former steps, just changing the session name :

4.1 Main directory

## Setting the project name
project_name <- "ebaii_sc_teachers"  # Do not copy-paste this ! It's MY project !!

## Preparing the path
TD_dir <- paste0("/shared/projects/", project_name, "/SC_TD")

## Creating the root directory
dir.create(path = TD_dir, recursive = TRUE)

## Print the root directory on-screen
print(TD_dir)
[1] "/shared/projects/ebaii_sc_teachers/SC_TD"

4.2 Current session

## Creating the session (Preproc.1) directory
session_dir <- paste0(TD_dir, "/06_Integration")
dir.create(path = session_dir, recursive = TRUE)
## Print the session directory on-screen
print(session_dir)
[1] "/shared/projects/ebaii_sc_teachers/SC_TD/06_Integration"

4.3 Input directory

## Creating the INPUT data directory
input_dir <- paste0(session_dir, "/DATA")
dir.create(path = input_dir, recursive = TRUE)

## Print the input directory on-screen
print(input_dir)
[1] "/shared/projects/ebaii_sc_teachers/SC_TD/06_Integration/DATA"

4.4 Output directory

## Creating the OUTPUT data directory
output_dir <- paste0(session_dir, "/RESULTS")
dir.create(path = output_dir, recursive = TRUE)

## Print the output directory on-screen
print(output_dir)
[1] "/shared/projects/ebaii_sc_teachers/SC_TD/06_Integration/RESULTS"


5 Load input data

  • The data consists into a pair of Seurat objects (one per sample) as RDS, hosted in a Zenodo respository (Id : 14081572)

5.1 Download data

  • We will directly retrieve data from Zenodo to your input_dir :
### Named files (will be used later on !)
TD3A_rds <- "TD3A_Filtered_12508.2018.RDS"
TDCT_rds <- "TDCT_Filtered_12254.1868.RDS"

## Filename(s) to retrieve
toget_files <- c(TD3A_rds,
                 TDCT_rds)

## Folder to store retrieved files
local_folder <- input_dir

## Use local backup ?
backup <- FALSE
if(backup) message("Using local backup !")

## Force download ?
force <- FALSE
if(force) message("Forcing (re)download !")

## Zenodo ID
zen_id <- '14094752'

### Define remote folder
remote_folder <- if (backup) "/shared/projects/2422_ebaii_n1/atelier_scrnaseq/TD/BACKUP/INTEGRATION/" else paste0("https://zenodo.org/records/", zen_id, "/files/")

### Reconstruct the input paths
remote_path <- paste0(remote_folder, "/", toget_files)

### Reconstruct the output paths
local_path <- paste0(local_folder, "/", toget_files)

## Retrieve files (if they don't exist), in loop
for (tg in seq_along(toget_files)) {
  ## If the file does not locally exist
  if (!file.exists(local_path[tg]) | force) {
    ## Retrieve data
    if(backup) {
      file.copy(from = remote_path[tg],
                to = local_path[tg])
    } else {
      download.file(url = remote_path[tg], 
                    destfile = local_path[tg])
    }
    ## Check if downloaded files exist locally
    if(file.exists(local_path[tg])) message("\tOK")
  } else message(paste0(toget_files[tg], " already downloaded !"))
}
TD3A_Filtered_12508.2018.RDS already downloaded !
TDCT_Filtered_12254.1868.RDS already downloaded !

5.2 Load into R

## Loading TD3A
TD3A_so <- readRDS(file = paste0(input_dir, 
                                   "/", 
                                   TD3A_rds))

## Loading TDCT
TDCT_so <- readRDS(file = paste0(input_dir, 
                                   "/",
                                   TDCT_rds))


6 Merge datasets

  • We simply merge the two objects into a single one :
TD3A_TDCT_merge <- merge(
  x = TD3A_so, ## First object to merge
  y = TDCT_so, ## Second object to merge
  add.cell.ids = c("TD3A", "TDCT"), ## Keep track of sample of origin for cells (optional)
)
  • We can take a look at its summary
## See the object summary
TD3A_TDCT_merge
Show output
An object of class Seurat 
12926 features across 3886 samples within 1 assay 
Active assay: RNA (12926 features, 0 variable features)
 4 layers present: counts.TD3A, counts.TDCT, data.TD3A, data.TDCT
## TD3A
dim(TD3A_so)
[1] 12508  2018
## TDCT
dim(TDCT_so)
[1] 12254  1868
## Merged
dim(TD3A_TDCT_merge)
[1] 12926  3886
## Clean unneeded objects
rm(TD3A_so, TDCT_so)

6.1 Process the merged dataset

In order to visualize the data in a 2-D space, one needs to process it the way one learnt :

## All steps
TD3A_TDCT_merge <- NormalizeData(object = TD3A_TDCT_merge, verbose = FALSE)
TD3A_TDCT_merge <- FindVariableFeatures(object = TD3A_TDCT_merge, nfeatures = 2000, verbose = FALSE)
TD3A_TDCT_merge <- ScaleData(object = TD3A_TDCT_merge, verbose = FALSE)
TD3A_TDCT_merge <- RunPCA(object = TD3A_TDCT_merge, npcs = 50, verbose = FALSE)
ElbowPlot(object = TD3A_TDCT_merge, ndims = 50, reduction = "pca")
Show plot

6.2 Visualization of batch effect

DimPlot(object = TD3A_TDCT_merge, 
        reduction = "pca",
        group.by = "orig.ident", 
        cols = sample_cols)
Show plot

TD3A_TDCT_merge <- RunUMAP(object = TD3A_TDCT_merge, dims = 1:20, verbose = FALSE)
DimPlot(object = TD3A_TDCT_merge, 
        reduction = "umap",
        group.by = "orig.ident", 
        cols = sample_cols)
Show plot

CCL ??

7 Integration

7.1 Seurat : CCA

7.1.1 CCA integration

computing_time <- data.frame()
start <- Sys.time()
TD3A_TDCT_merge <- IntegrateLayers(object = TD3A_TDCT_merge, 
                                      orig.reduction = "pca",
                                      method = CCAIntegration, 
                                      new.reduction = "integrated.cca", 
                                      normalization.method = "LogNormalize")
end <- Sys.time()
computing_time["cca","comptime"] <- end-start

7.1.2 Seurat : RPCA

start <- Sys.time()
TD3A_TDCT_merge <- IntegrateLayers(object = TD3A_TDCT_merge, 
                                      orig.reduction = "pca",
                                      method = RPCAIntegration, 
                                      new.reduction = "integrated.rpca", 
                                      normalization.method = "LogNormalize")
end <- Sys.time()
computing_time["rPCA","comptime"] <- end-start

7.1.3 Seurat : Harmony

start <- Sys.time()
TD3A_TDCT_merge <- IntegrateLayers(object = TD3A_TDCT_merge, 
                                      orig.reduction = "pca",
                                      method = HarmonyIntegration, 
                                      new.reduction = "harmony",
                                      normalization.method = "LogNormalize")
end <- Sys.time()
computing_time["Harmony","comptime"] <- end-start
Show output
comptime
cca 6.603909 secs
rPCA 8.666372 secs
Harmony 2.732028 secs

Reductions(object = TD3A_TDCT_merge)

7.2 Computing umap from “corrected” spaces

TD3A_TDCT_merge <- RunUMAP(object = TD3A_TDCT_merge, 
                           reduction = "integrated.cca", 
                           dims = 1:20, 
                           reduction.name = "umapcca")
TD3A_TDCT_merge <- RunUMAP(object = TD3A_TDCT_merge, 
                           reduction = "integrated.rpca",
                           dims = 1:20, 
                           reduction.name = "umaprpca")
TD3A_TDCT_merge <- RunUMAP(object = TD3A_TDCT_merge, 
                           reduction = "harmony", 
                           dims = 1:20, 
                           reduction.name = "umapharm")
Reductions(object = TD3A_TDCT_merge)
Show output
[1] "pca"             "umap"            "integrated.cca"  "integrated.rpca"
[5] "harmony"         "umapcca"         "umaprpca"        "umapharm"       

7.3 Effect of the correction/integration

7.4 Visualization of the different methods of integration: Projecting the sample id onto the different umap spaces obtained

Show plot

7.4.1 Clustering

7.4.1.0.1 Graph based on CCA Intgration
TD3A_TDCT_merge <- FindNeighbors(object = TD3A_TDCT_merge, 
                                 reduction = "integrated.cca", 
                                 dims = 1:20, 
                                 graph.name = "snn_cca")
TD3A_TDCT_merge <- FindClusters(object = TD3A_TDCT_merge,
                                graph.name = "snn_cca",
                                resolution = c(.5,.8,1), 
                                algorithm = 4)
7.4.1.0.2 Graph based on rPCA Intgration
TD3A_TDCT_merge <- FindNeighbors(object = TD3A_TDCT_merge, 
                                 reduction = "integrated.rpca", 
                                 dims = 1:20, 
                                 graph.name = "snn_rpca")
TD3A_TDCT_merge <- FindClusters(object = TD3A_TDCT_merge,
                                graph.name = "snn_rpca",
                                resolution = c(.5,.8,1), 
                                algorithm = 4)
7.4.1.0.3 Graph based on Harmony Intgration
TD3A_TDCT_merge <- FindNeighbors(object = TD3A_TDCT_merge, 
                                 reduction = "harmony", 
                                 dims = 1:20, 
                                 graph.name = "snn_harm")
TD3A_TDCT_merge <- FindClusters(object = TD3A_TDCT_merge,
                                graph.name = "snn_harm",
                                resolution = c(.5,.8,1), 
                                algorithm = 4)
Show plot

Show plot

Show plot

Show plot

Show plot

Show plot