EB3I n1 2025 scRNAseq
-
INTEGRATION
-

1 PREAMBLE

1.1 Purpose of this session

This file describes the different steps to perform sixth part of data processing for the single cell RNAseq data analysis training course for the EBAII n1 2024, covering these steps :

Integration of a pair of samples

1.2 The integration data set

We will still use data from the Paiva et al. publication
While we focused on the KO sample TD3a until now, we will integrate it with its wild-type counterpart TDCT
TDCT was processed in a very similar manner (QC, doublers, filtering, LogNormalization, HVGs and scaling) to what you just performed for TD3A
Both samples were halved for their number of cells, for practical reasons in the context of this training (computation speed)
As the purpose is to integrate both samples into a single analytical entity, the objects we will use as input are Seurat objects up to the data scaling and regression (thus, there are no dimension reduction data in these)

2 Start Rstudio

Using the OpenOnDemand cheat sheet, connect to the OpenOnDemand portal and create a Rstudio session with the right resource requirements.

3 Warm-up

We set common parameters we will use throughout this session :

## Seed for the RNG (our sole parameter for this session)
my_seed <- 1337L

## Required for our big objects
options(future.globals.maxSize = 1E+09)

library(Seurat)
library(ggplot2)
library(patchwork)
library(dplyr)

4 Prepare the data structure

We will do the same as for former steps, just changing the session name :

4.1 Main directory

## Setting the project name
project_name <- "ebaii_sc_teachers"  # Do not copy-paste this ! It's MY project !!

## Preparing the path
TD_dir <- paste0("/shared/projects/", project_name, "/SC_TD")

## Creating the root directory
dir.create(path = TD_dir, recursive = TRUE)

## Print the root directory on-screen
print(TD_dir)

[1] "/shared/projects/ebaii_sc_teachers/SC_TD"

4.2 Current session

## Creating the session (Preproc.1) directory
session_dir <- paste0(TD_dir, "/06_Integration")
dir.create(path = session_dir, recursive = TRUE)
## Print the session directory on-screen
print(session_dir)

[1] "/shared/projects/ebaii_sc_teachers/SC_TD/06_Integration"

4.3 Input directory

## Creating the INPUT data directory
input_dir <- paste0(session_dir, "/DATA")
dir.create(path = input_dir, recursive = TRUE)

## Print the input directory on-screen
print(input_dir)

[1] "/shared/projects/ebaii_sc_teachers/SC_TD/06_Integration/DATA"

4.4 Output directory

## Creating the OUTPUT data directory
output_dir <- paste0(session_dir, "/RESULTS")
dir.create(path = output_dir, recursive = TRUE)

## Print the output directory on-screen
print(output_dir)

[1] "/shared/projects/ebaii_sc_teachers/SC_TD/06_Integration/RESULTS"

5 Load input data

The data consists into a pair of Seurat objects (one per sample) as RDS, hosted in a Zenodo respository (Id : 14081572)

5.1 Download data

We will directly retrieve data from Zenodo to your input_dir :

### Named files (will be used later on !)
TD3A_rds <- "TD3A_Filtered_12508.2018.RDS"
TDCT_rds <- "TDCT_Filtered_12254.1868.RDS"

## Filename(s) to retrieve
toget_files <- c(TD3A_rds,
                 TDCT_rds)

## Folder to store retrieved files
local_folder <- input_dir

## Use local backup ?
backup <- FALSE
if(backup) message("Using local backup !")

## Force download ?
force <- FALSE
if(force) message("Forcing (re)download !")

## Zenodo ID
zen_id <- '14094752'

### Define remote folder
remote_folder <- if (backup) "/shared/projects/2422_ebaii_n1/atelier_scrnaseq/TD/BACKUP/INTEGRATION/" else paste0("https://zenodo.org/records/", zen_id, "/files/")

### Reconstruct the input paths
remote_path <- paste0(remote_folder, "/", toget_files)

### Reconstruct the output paths
local_path <- paste0(local_folder, "/", toget_files)

## Retrieve files (if they don't exist), in loop
for (tg in seq_along(toget_files)) {
  ## If the file does not locally exist
  if (!file.exists(local_path[tg]) | force) {
    ## Retrieve data
    if(backup) {
      file.copy(from = remote_path[tg],
                to = local_path[tg])
    } else {
      download.file(url = remote_path[tg], 
                    destfile = local_path[tg])
    }
    ## Check if downloaded files exist locally
    if(file.exists(local_path[tg])) message("\tOK")
  } else message(paste0(toget_files[tg], " already downloaded !"))
}

TD3A_Filtered_12508.2018.RDS already downloaded !

TDCT_Filtered_12254.1868.RDS already downloaded !

5.2 Load into R

## Loading TD3A
TD3A_so <- readRDS(file = paste0(input_dir, 
                                   "/", 
                                   TD3A_rds))

## Loading TDCT
TDCT_so <- readRDS(file = paste0(input_dir, 
                                   "/",
                                   TDCT_rds))

6 Merge datasets

We simply merge the two objects into a single one :

TD3A_TDCT_merge <- merge(
  x = TD3A_so, ## First object to merge
  y = TDCT_so, ## Second object to merge
  add.cell.ids = c("TD3A", "TDCT"), ## Keep track of sample of origin for cells (optional)
)

We can take a look at its summary

## See the object summary
TD3A_TDCT_merge

Show output

An object of class Seurat 
12926 features across 3886 samples within 1 assay 
Active assay: RNA (12926 features, 0 variable features)
 4 layers present: counts.TD3A, counts.TDCT, data.TD3A, data.TDCT

## TD3A
dim(TD3A_so)

[1] 12508  2018

## TDCT
dim(TDCT_so)

[1] 12254  1868

## Merged
dim(TD3A_TDCT_merge)

[1] 12926  3886

## Clean unneeded objects
rm(TD3A_so, TDCT_so)

6.1 Process the merged dataset

In order to visualize the data in a 2-D space, one needs to process it the way one learnt :

## All steps
TD3A_TDCT_merge <- NormalizeData(object = TD3A_TDCT_merge, verbose = FALSE)
TD3A_TDCT_merge <- FindVariableFeatures(object = TD3A_TDCT_merge, nfeatures = 2000, verbose = FALSE)
TD3A_TDCT_merge <- ScaleData(object = TD3A_TDCT_merge, verbose = FALSE)
TD3A_TDCT_merge <- RunPCA(object = TD3A_TDCT_merge, npcs = 50, verbose = FALSE)

ElbowPlot(object = TD3A_TDCT_merge, ndims = 50, reduction = "pca")

Show plot

6.2 Visualization of batch effect

DimPlot(object = TD3A_TDCT_merge, 
        reduction = "pca",
        group.by = "orig.ident", 
        cols = sample_cols)

Show plot

TD3A_TDCT_merge <- RunUMAP(object = TD3A_TDCT_merge, dims = 1:20, verbose = FALSE)

DimPlot(object = TD3A_TDCT_merge, 
        reduction = "umap",
        group.by = "orig.ident", 
        cols = sample_cols)

Show plot

CCL ??

7 Integration

7.1 Seurat : CCA

7.1.1 CCA integration

computing_time <- data.frame()

start <- Sys.time()
TD3A_TDCT_merge <- IntegrateLayers(object = TD3A_TDCT_merge, 
                                      orig.reduction = "pca",
                                      method = CCAIntegration, 
                                      new.reduction = "integrated.cca", 
                                      normalization.method = "LogNormalize")
end <- Sys.time()
computing_time["cca","comptime"] <- end-start

7.1.2 Seurat : RPCA

start <- Sys.time()
TD3A_TDCT_merge <- IntegrateLayers(object = TD3A_TDCT_merge, 
                                      orig.reduction = "pca",
                                      method = RPCAIntegration, 
                                      new.reduction = "integrated.rpca", 
                                      normalization.method = "LogNormalize")
end <- Sys.time()
computing_time["rPCA","comptime"] <- end-start

7.1.3 Seurat : Harmony

start <- Sys.time()
TD3A_TDCT_merge <- IntegrateLayers(object = TD3A_TDCT_merge, 
                                      orig.reduction = "pca",
                                      method = HarmonyIntegration, 
                                      new.reduction = "harmony",
                                      normalization.method = "LogNormalize")
end <- Sys.time()
computing_time["Harmony","comptime"] <- end-start

Show output

	comptime
cca	6.603909 secs
rPCA	8.666372 secs
Harmony	2.732028 secs

Reductions(object = TD3A_TDCT_merge)

7.2 Computing umap from “corrected” spaces

TD3A_TDCT_merge <- RunUMAP(object = TD3A_TDCT_merge, 
                           reduction = "integrated.cca", 
                           dims = 1:20, 
                           reduction.name = "umapcca")

TD3A_TDCT_merge <- RunUMAP(object = TD3A_TDCT_merge, 
                           reduction = "integrated.rpca",
                           dims = 1:20, 
                           reduction.name = "umaprpca")

TD3A_TDCT_merge <- RunUMAP(object = TD3A_TDCT_merge, 
                           reduction = "harmony", 
                           dims = 1:20, 
                           reduction.name = "umapharm")

Reductions(object = TD3A_TDCT_merge)

Show output

[1] "pca"             "umap"            "integrated.cca"  "integrated.rpca"
[5] "harmony"         "umapcca"         "umaprpca"        "umapharm"

7.3 Effect of the correction/integration

7.4 Visualization of the different methods of integration: Projecting the sample id onto the different umap spaces obtained

Show plot

7.4.1 Clustering

7.4.1.0.1 Graph based on CCA Intgration

TD3A_TDCT_merge <- FindNeighbors(object = TD3A_TDCT_merge, 
                                 reduction = "integrated.cca", 
                                 dims = 1:20, 
                                 graph.name = "snn_cca")
TD3A_TDCT_merge <- FindClusters(object = TD3A_TDCT_merge,
                                graph.name = "snn_cca",
                                resolution = c(.5,.8,1), 
                                algorithm = 4)

7.4.1.0.2 Graph based on rPCA Intgration

TD3A_TDCT_merge <- FindNeighbors(object = TD3A_TDCT_merge, 
                                 reduction = "integrated.rpca", 
                                 dims = 1:20, 
                                 graph.name = "snn_rpca")
TD3A_TDCT_merge <- FindClusters(object = TD3A_TDCT_merge,
                                graph.name = "snn_rpca",
                                resolution = c(.5,.8,1), 
                                algorithm = 4)

7.4.1.0.3 Graph based on Harmony Intgration

TD3A_TDCT_merge <- FindNeighbors(object = TD3A_TDCT_merge, 
                                 reduction = "harmony", 
                                 dims = 1:20, 
                                 graph.name = "snn_harm")
TD3A_TDCT_merge <- FindClusters(object = TD3A_TDCT_merge,
                                graph.name = "snn_harm",
                                resolution = c(.5,.8,1), 
                                algorithm = 4)

Show plot

table("snn_harm_res.1" = TD3A_TDCT_merge$snn_harm_res.1,
      "snn_rpca_res.1" = TD3A_TDCT_merge$snn_rpca_res.1)

Show output

              snn_rpca_res.1
snn_harm_res.1   1   2   3   4   5   6   7   8   9  10  11  12  13
            1  448   1  18 124  50   1   0   3   4   0   0   0   0
            2   74   0  32  58  72   0   0 294   6   0   0   0   0
            3   21   0 454   0   0   0   0  13   4   0   0   0   0
            4    0 439   0   2   3   5   0   0   1   0   0   0  12
            5    3  60   0  41 303   6   0   1   1   0   0   0   0
            6   10   0   0   0   0 373   0   0   1   1   0   0   0
            7    5  20   3 224   4   0   0   2   3   0   0   0   0
            8    0   0   0   0   0   0 243   0   0   0   0   0   1
            9    8   5   6   5   6   1   0   2 116   0   0   0   0
            10   0   0   0   0   0   0  82   0   0   0   0   0   0
            11   0   0   0   3   0   0   0   2   0  69   0   0   0
            12   0   0   0   0   0   1   0   0   0   0  64   0   0
            13   0   0   0   0   0   1   0   0   0   1   1  47   0
            14   0   0   0   0   0   0   0   0   0   0   0   0  22

table(TD3A_TDCT_merge$orig.ident, TD3A_TDCT_merge$snn_cca_res.1)

Show output

      
         1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
  TD3A 422 184 226 193 199 239 144  75 113  69  24  59  54   7  10
  TDCT 286 326 229 207 197 114 136 168  31  32  58   8   9  43  24

table(TD3A_TDCT_merge$orig.ident, TD3A_TDCT_merge$snn_harm_res.1)

Show output

      
         1   2   3   4   5   6   7   8   9  10  11  12  13  14
  TD3A 423 218 303 144 142 190 231  77 130  24  67  56   6   7
  TDCT 226 318 189 318 273 195  30 167  19  58   7   9  44  15

TD3A_TDCT_merge <- JoinLayers(object = TD3A_TDCT_merge, assay = "RNA")

If we have extra-time

Idents(object = TD3A_TDCT_merge) <- "snn_harm_res.1"
harm_res1_markers <- FindAllMarkers(
  object = TD3A_TDCT_merge,
  only.pos = TRUE,
  logfc.threshold = 1
)

harm_res1_topmarkers <- harm_res1_markers %>%
  group_by(cluster) %>%
  dplyr::slice_head(n = 20)

DoHeatmap(object = TD3A_TDCT_merge,
          features = harm_res1_topmarkers$gene,
          group.by = "snn_harm_res.1")

Show plot

8 Save the Seurat object

## A name for our Seurat object file (rich naming)
out_name <- paste0(
          output_dir, "/", paste(
            c("Twelve", Seurat::Project(TD3A_TDCT_merge), "S5", 
              "Integrated", paste(
                dim(TD3A_TDCT_merge), collapse = '.'
              )
            ), collapse = "_"),
            ".RDS")

## Check (I like this)
print(out_name)

[1] "/shared/projects/ebaii_sc_teachers/SC_TD/06_Integration/RESULTS/Twelve_SeuratProject_S5_Integrated_12926.3886.RDS"

## Write on disk
saveRDS(object = TD3A_TDCT_merge, 
        file = out_name)

9 Rsession

utils::sessionInfo()

Show output

R version 4.4.1 (2024-06-14)
Platform: x86_64-conda-linux-gnu
Running under: Ubuntu 22.04.5 LTS

Matrix products: default
BLAS/LAPACK: /shared/ifbstor1/software/miniconda/envs/r-4.4.1/lib/libopenblasp-r0.3.29.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Paris
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] future_1.49.0      dplyr_1.1.4        patchwork_1.3.0    ggplot2_3.5.2     
[5] Seurat_5.3.0       SeuratObject_5.1.0 sp_2.2-0          

loaded via a namespace (and not attached):
  [1] deldir_2.0-4           pbapply_1.7-2          gridExtra_2.3         
  [4] rlang_1.1.6            magrittr_2.0.3         RcppAnnoy_0.0.22      
  [7] spatstat.geom_3.4-1    matrixStats_1.5.0      ggridges_0.5.6        
 [10] compiler_4.4.1         png_0.1-8              vctrs_0.6.5           
 [13] reshape2_1.4.4         stringr_1.5.1          pkgconfig_2.0.3       
 [16] fastmap_1.2.0          labeling_0.4.3         rmdformats_1.0.4      
 [19] promises_1.3.2         rmarkdown_2.29         purrr_1.0.4           
 [22] xfun_0.52              cachem_1.1.0           jsonlite_2.0.0        
 [25] goftest_1.2-3          later_1.4.2            spatstat.utils_3.1-4  
 [28] irlba_2.3.5.1          parallel_4.4.1         cluster_2.1.6         
 [31] R6_2.6.1               ica_1.0-3              spatstat.data_3.1-6   
 [34] bslib_0.9.0            stringi_1.8.7          RColorBrewer_1.1-3    
 [37] limma_3.60.6           reticulate_1.42.0      spatstat.univar_3.1-3 
 [40] parallelly_1.45.0      lmtest_0.9-40          jquerylib_0.1.4       
 [43] scattermore_1.2        Rcpp_1.0.14            bookdown_0.39         
 [46] knitr_1.50             tensor_1.5             future.apply_1.11.3   
 [49] zoo_1.8-14             sctransform_0.4.2      httpuv_1.6.15         
 [52] Matrix_1.7-3           splines_4.4.1          igraph_2.1.4          
 [55] tidyselect_1.2.1       abind_1.4-8            rstudioapi_0.17.1     
 [58] dichromat_2.0-0.1      yaml_2.3.10            spatstat.random_3.4-1 
 [61] codetools_0.2-20       miniUI_0.1.2           spatstat.explore_3.4-3
 [64] listenv_0.9.1          lattice_0.22-6         tibble_3.2.1          
 [67] plyr_1.8.9             withr_3.0.2            shiny_1.10.0          
 [70] ROCR_1.0-11            evaluate_1.0.3         Rtsne_0.17            
 [73] fastDummies_1.7.5      survival_3.7-0         polyclip_1.10-7       
 [76] fitdistrplus_1.2-2     pillar_1.10.2          leidenbase_0.1.35     
 [79] KernSmooth_2.23-24     plotly_4.10.4          generics_0.1.4        
 [82] RcppHNSW_0.6.0         scales_1.4.0           globals_0.18.0        
 [85] xtable_1.8-4           RhpcBLASctl_0.23-42    glue_1.8.0            
 [88] lazyeval_0.2.2         tools_4.4.1            data.table_1.17.4     
 [91] RSpectra_0.16-2        RANN_2.6.2             dotCall64_1.2         
 [94] cowplot_1.1.3          grid_4.4.1             tidyr_1.3.1           
 [97] nlme_3.1-165           presto_1.0.0           cli_3.6.5             
[100] spatstat.sparse_3.1-0  spam_2.11-1            viridisLite_0.4.2     
[103] uwot_0.2.3             gtable_0.3.6           sass_0.4.10           
[106] digest_0.6.37          progressr_0.15.1       ggrepel_0.9.6         
[109] htmlwidgets_1.6.4      farver_2.1.2           htmltools_0.5.8.1     
[112] lifecycle_1.0.4        httr_1.4.7             statmod_1.5.0         
[115] mime_0.13              harmony_1.2.3          MASS_7.3-65

EB3I n1 2025 scRNAseq-INTEGRATION-