Allows to hide the TOC by default, display it with a button, move it to the right or left of the page

Load the latest Seurat object

Start a Rstudio session

Load the latest Seurat objects saved as RDS.

data_dir <- '/shared/projects/2325_ebaii/SingleCell/'
s1 <- readRDS(paste0(data_dir, 'TD_DATA/DATA_START/INTEGRATION/TD3A_LogNorm.RDS'))
s2 <- readRDS(paste0(data_dir, 'TD_DATA/DATA_START/INTEGRATION/TDCT_LogNorm.RDS'))

Have a look at the description of the data

EBAII.n1.SC.helper::seurat4_descriptor(sobj = s1, describe = "assay")

OBJECT VERSION :    5.0.0 
PROJECT :   [TD3A] 

[ASSAYS]
   ASSAY 1 :    [RNA] [ACTIVE] 
      SLOT 1 :  [counts]    Dims:[12508 x 4038]  Range:[0.00-1588.00] 
         Counts :   14263194 
         Sparsity : 86.87712% 
      SLOT 2 :  [data]  Dims:[12508 x 4038]  Range:[0.00-7.71] 
         Sparsity : 86.87712% 
      SLOT 3 :  [scale.data]    Dims:[0 x 0]

EBAII.n1.SC.helper::seurat4_descriptor(sobj = s2, describe = "assay")

OBJECT VERSION :    5.0.0 
PROJECT :   [TDCT] 

[ASSAYS]
   ASSAY 1 :    [RNA] [ACTIVE] 
      SLOT 1 :  [counts]    Dims:[12254 x 3671]  Range:[0.00-1010.00] 
         Counts :   15936670 
         Sparsity : 87.47972% 
      SLOT 2 :  [data]  Dims:[12254 x 3671]  Range:[0.00-7.20] 
         Sparsity : 87.47972% 
      SLOT 3 :  [scale.data]    Dims:[0 x 0]

Evaluation of batch effect

First, we will merge (concatenate) the data.

sobj <- merge(s1, s2, merge.data = TRUE)

dim(s1)

[1] 12508  4038

dim(s2)

[1] 12254  3671

dim(sobj)

[1] 12926  7709

rm(s1, s2)

We will have a look at their low dimension representation. With the help of the quick visualization function, we will plot UMAP and color the cells according to their sample of origin.

VIZ <- EBAII.n1.SC.helper::QnD_viz(sobj = sobj, assay = 'RNA', slot = 'data', group_by = 'orig.ident', return_object = TRUE)

VIZ

An object of class Seurat 
12926 features across 7709 samples within 1 assay 
Active assay: RNA (12926 features, 2000 variable features)
 3 layers present: counts, data, scale.data
 2 dimensional reductions calculated: pca, umap

Cells are clearly separated depending on their sample of origin. There is a batch effect that needs to be corrected.

Integration with Harmony

Scaling and bias regression

Harmony works on normalized and scaled data. We will need to scale the new merged object. Scaling is also the good time to try to correct technical sources of biases.

Indeed, some technical factors can bias the data: they will affect the UMAP, the future clustering…

These technical factors can be the number of UMIs(nCount_RNA), the percent of mitochondrial RNA… All the factor that we see in the metadata table could be a source of bias.

head(sobj@meta.data)

                     orig.ident nCount_RNA nFeature_RNA log10_nCount_RNA
AAACCTGAGACGCTTT.1_1       TD3A       2807         1588         3.449324
AAACCTGAGGCATTGG.1_1       TD3A       2072         1365         3.316599
AAACCTGGTCAACATC.1_1       TD3A       2024         1340         3.306639
AAACCTGTCGAGGTAG.1_1       TD3A       1877         1241         3.273696
AAACCTGTCGATCCCT.1_1       TD3A       2213         1438         3.345766
AAACGGGCACTTACGA.1_1       TD3A       2445         1428         3.388456
                     percent_mt percent_rb percent_st CC_Seurat_S.Score
AAACCTGAGACGCTTT.1_1 0.01670814 0.12193388 0.02950587        -0.3206662
AAACCTGAGGCATTGG.1_1 0.03426641 0.05260618 0.02702703        -0.2089766
AAACCTGGTCAACATC.1_1 0.01975309 0.08592593 0.02419753        -0.2021102
AAACCTGTCGAGGTAG.1_1 0.02930208 0.08044752 0.03036761        -0.2546360
AAACCTGTCGATCCCT.1_1 0.03203971 0.07626354 0.02933213        -0.2495052
AAACGGGCACTTACGA.1_1 0.03312883 0.08957055 0.03353783        -0.3101915
                     CC_Seurat_G2M.Score CC_Seurat_Phase CC_Seurat_SmG2M.Score
AAACCTGAGACGCTTT.1_1         -0.14298427              G1           -0.17768197
AAACCTGAGGCATTGG.1_1         -0.15818260              G1           -0.05079398
AAACCTGGTCAACATC.1_1         -0.17575067              G1           -0.02635950
AAACCTGTCGAGGTAG.1_1         -0.01628821              G1           -0.23834777
AAACCTGTCGATCCCT.1_1         -0.13375529              G1           -0.11574990
AAACGGGCACTTACGA.1_1         -0.18062013              G1           -0.12957140
                     doublet_scds.hybrid doublet_scDblFinder doublet_union
AAACCTGAGACGCTTT.1_1               FALSE               FALSE         FALSE
AAACCTGAGGCATTGG.1_1               FALSE               FALSE         FALSE
AAACCTGGTCAACATC.1_1               FALSE               FALSE         FALSE
AAACCTGTCGAGGTAG.1_1               FALSE               FALSE         FALSE
AAACCTGTCGATCCCT.1_1               FALSE               FALSE         FALSE
AAACGGGCACTTACGA.1_1               FALSE               FALSE         FALSE
                     doublet_viz
AAACCTGAGACGCTTT.1_1     singlet
AAACCTGAGGCATTGG.1_1     singlet
AAACCTGGTCAACATC.1_1     singlet
AAACCTGTCGAGGTAG.1_1     singlet
AAACCTGTCGATCCCT.1_1     singlet
AAACGGGCACTTACGA.1_1     singlet

We will check if this is the case with a helper function

EBAII.n1.SC.helper::dimred_covar_cor(sobj = VIZ, dimred = 'pca', ncomp = 20)

The heatmap shows that the contribution of the technical factors to the PCA dimension. The strength of this contribution is indicated by the intensity of the blue color.

Ideally, if the factor is not a source of bias, its contribution to all dimensions is roughly the same. If the factor contributes a lot to a few dimensions, we should consider regressing it out.

Here we will need to remove the influence of these technical factor. We say that we regress out some variables. This is done within the ScaleData() function.

! technical issue: this step take time, we will not run the following command. We will load a scaled object instead.

## technical issue: on the cluster it take too much time, we will not run this command
## we will load a preconstructed file instead.
vtr = c('nCount_RNA', 'percent_rb')
sobj <- Seurat::ScaleData(object = sobj, assay = 'RNA', nfeatures = 2000, vars.to.regress = vtr)
sobj <- Seurat::FindVariableFeatures(sobj)

saveRDS(sobj, paste0(data_dir, 'TD_DATA/DATA_START/INTEGRATION/object_to_integrate.RDS'))

sobj <- readRDS(paste0(data_dir, 'TD_DATA/DATA_START/INTEGRATION/object_to_integrate.RDS'))

sobj <- Seurat::RunPCA(object = sobj, npcs = 50, seed.use = my_seed, verbose = FALSE)
EBAII.n1.SC.helper::dimred_covar_cor(sobj = sobj, dimred = 'pca', ncomp = 20)

Dimension reduction

Harmony works in low dimension. We have just applied PCA to the data.

Remind that PCA replaces the thousand of genes by a small number of dimesions that summarize most of the information.

For integration, we want to use only the informative dimensions. We will choose them with a elbow plot.

### Finding the number of dims to use ====
Seurat::ElbowPlot(object = sobj, reduction = 'pca', ndims = 50)

## Result : we will keep 20 dimensions for Harmony
KDIM <- 20

The dataset is quite simple. Here we see that the 20 first dimensions summarize most of the information (most of the variability). So we will only need to make Harmony work on those 20 dimensions.

For more complex datasets, the default value is generally 50, but it needs to be adjusted to each specific dataset.

Harmony integration

We are now ready to apply Harmony to integrated the different datasets.

sobj <- harmony::RunHarmony(object = sobj, reduction.use = 'pca', dims.use = 1:KDIM, group.by.vars = 'orig.ident', max_iter = 100) ## or sigma .2

Control : evaluating the residual effect of covariables

We removed the principal sources of biases and integrated the datasets. We can once again look at the heatmap of the effect of the technical factors to make sure that we remove the majority of the

VIZ <- EBAII.n1.SC.helper::QnD_viz(sobj = sobj, assay = 'harmony', slot = NULL, dimred = 'harmony', ncomp = KDIM, group_by = 'orig.ident', return_object = TRUE)

EBAII.n1.SC.helper::dimred_covar_cor(sobj = VIZ, dimred = 'harmony', ncomp = 20)

Last thoughts: integration vs bias regression

We saw earlier that the columns of the metadata can be seen as potential technical biases to correct.

colnames(sobj@meta.data)

 [1] "orig.ident"            "nCount_RNA"            "nFeature_RNA"         
 [4] "log10_nCount_RNA"      "percent_mt"            "percent_rb"           
 [7] "percent_st"            "CC_Seurat_S.Score"     "CC_Seurat_G2M.Score"  
[10] "CC_Seurat_Phase"       "CC_Seurat_SmG2M.Score" "doublet_scds.hybrid"  
[13] "doublet_scDblFinder"   "doublet_union"         "doublet_viz"

One of the metadata columns, orig.ident indicates the sample of origin, that is to say the information about the batch. This is precisely the “bias” that we try to remove when we perform integration.

So why can’t we just regress out orig.ident as any other column ?

Earlier, we regressed out nCount_RNA and percent_rb with the command:

vtr = c('nCount_RNA', 'percent_rb')
sobj <- Seurat::ScaleData(object = sobj, assay = 'RNA', nfeatures = 2000, vars.to.regress = vtr)

Why do we need special integration methods, couldn’t we just run the command

vtr = c('nCount_RNA', 'percent_rb', 'orig.ident')
sobj <- Seurat::ScaleData(object = sobj, assay = 'RNA', nfeatures = 2000, vars.to.regress = vtr)

Let’s have a try.

vtr <- 'orig.ident'
sobj <- Seurat::FindVariableFeatures(object = sobj, assay = 'RNA', nfeatures = 2000)
sobj <- Seurat::ScaleData(object = sobj, assay = 'RNA', vars.to.regress = vtr)

VIZ <- EBAII.n1.SC.helper::QnD_viz(sobj = sobj, assay = 'RNA', slot = 'scale.data', group_by = 'orig.ident', return_object = TRUE)

EBAII.n1.SC.helper::dimred_covar_cor(sobj = VIZ, dimred = 'pca', ncomp = 20)

Correcting only this variable is not enough.

vtr <- c('orig.ident', 'nCount_RNA', 'percent_rb')
sobj <- Seurat::FindVariableFeatures(object = sobj, assay = 'RNA', nfeatures = 2000)
sobj <- Seurat::ScaleData(object = sobj, assay = 'RNA', vars.to.regress = vtr)

VIZ <- EBAII.n1.SC.helper::QnD_viz(sobj = sobj, assay = 'RNA', slot = 'scale.data', group_by = 'orig.ident', return=TRUE)

EBAII.n1.SC.helper::dimred_covar_cor(sobj = VIZ, dimred = 'pca', ncomp = 20)

This is better but not as good as Harmony.

Actually, the batch effect has a very complex structure while the method used for bias correction is quite simple. It isl not able to correct the batch effect in a proper manner. It might not align correctly the cell populations or over-correct the data and align different cell populations !

The integration methods were developed specifically to account for the complexity of the batch effect and give better results.

Save results

saveRDS(object = sobj, file = './TD3A_integrated.RDS', 
        compress = 'bzip2')

Rsession

utils::sessionInfo()

R version 4.2.3 (2023-03-15)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS/LAPACK: /shared/ifbstor1/software/miniconda/envs/r-4.2.3/lib/libopenblasp-r0.3.21.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Biobase_2.58.0      BiocGenerics_0.44.0 SeuratObject_5.0.0 
[4] Seurat_4.4.0       

loaded via a namespace (and not attached):
  [1] rsvd_1.0.5                    Hmisc_5.1-1                  
  [3] ica_1.0-3                     foreach_1.5.2                
  [5] lmtest_0.9-40                 crayon_1.5.2                 
  [7] MASS_7.3-60                   nlme_3.1-163                 
  [9] backports_1.4.1               rlang_1.1.2                  
 [11] XVector_0.38.0                ROCR_1.0-11                  
 [13] irlba_2.3.5.1                 SparseM_1.81                 
 [15] SoupX_1.6.2                   limma_3.54.2                 
 [17] ca_0.71.1                     filelock_1.0.2               
 [19] pcaExplorer_2.24.0            GOstats_2.64.0               
 [21] BiocParallel_1.32.6           rjson_0.2.21                 
 [23] bit64_4.0.5                   glue_1.6.2                   
 [25] harmony_1.1.0                 pheatmap_1.0.12              
 [27] rngtools_1.5.2                sctransform_0.4.1            
 [29] parallel_4.2.3                spatstat.sparse_3.0-3        
 [31] AnnotationDbi_1.60.2          shinyAce_0.4.2               
 [33] shinydashboard_0.7.2          dotCall64_1.1-0              
 [35] spatstat.geom_3.2-7           tidyselect_1.2.0             
 [37] SummarizedExperiment_1.28.0   fitdistrplus_1.1-11          
 [39] XML_3.99-0.15                 tidyr_1.3.0                  
 [41] zoo_1.8-12                    xtable_1.8-4                 
 [43] magrittr_2.0.3                evaluate_0.23                
 [45] ggplot2_3.4.4                 scuttle_1.8.4                
 [47] cli_3.6.1                     zlibbioc_1.44.0              
 [49] rstudioapi_0.15.0             miniUI_0.1.1.1               
 [51] sp_2.1-1                      bslib_0.5.1                  
 [53] rpart_4.1.21                  shiny_1.7.5.1                
 [55] BiocSingular_1.14.0           xfun_0.41                    
 [57] clue_0.3-65                   cluster_2.1.4                
 [59] TSP_1.2-4                     KEGGREST_1.38.0              
 [61] tibble_3.2.1                  interactiveDisplayBase_1.36.0
 [63] ggrepel_0.9.4                 threejs_0.3.3                
 [65] listenv_0.9.0                 dendextend_1.17.1            
 [67] Biostrings_2.66.0             png_0.1-8                    
 [69] future_1.33.0                 withr_2.5.2                  
 [71] shinyBS_0.61.1                bitops_1.0-7                 
 [73] RBGL_1.74.0                   plyr_1.8.9                   
 [75] GSEABase_1.60.0               dqrng_0.3.1                  
 [77] pillar_1.9.0                  GlobalOptions_0.1.2          
 [79] cachem_1.0.8                  GetoptLong_1.0.5             
 [81] DelayedMatrixStats_1.20.0     vctrs_0.6.4                  
 [83] ellipsis_0.3.2                generics_0.1.3               
 [85] NMF_0.26                      tools_4.2.3                  
 [87] foreign_0.8-85                munsell_0.5.0                
 [89] DelayedArray_0.24.0           fastmap_1.1.1                
 [91] compiler_4.2.3                abind_1.4-5                  
 [93] httpuv_1.6.12                 ExperimentHub_2.6.0          
 [95] plotly_4.10.3                 GenomeInfoDbData_1.2.9       
 [97] gridExtra_2.3                 edgeR_3.40.2                 
 [99] lattice_0.22-5                deldir_1.0-9                 
[101] AnnotationForge_1.40.2        utf8_1.2.4                   
[103] later_1.3.1                   dplyr_1.1.3                  
[105] BiocFileCache_2.6.1           jsonlite_1.8.7               
[107] scales_1.2.1                  graph_1.76.0                 
[109] ScaledMatrix_1.6.0            pbapply_1.7-2                
[111] sparseMatrixStats_1.10.0      genefilter_1.80.3            
[113] lazyeval_0.2.2                promises_1.2.1               
[115] doParallel_1.0.17             goftest_1.2-3                
[117] spatstat.utils_3.0-4          reticulate_1.34.0            
[119] checkmate_2.3.0               rmarkdown_2.25               
[121] cowplot_1.1.1                 statmod_1.5.0                
[123] webshot_0.5.5                 Rtsne_0.16                   
[125] uwot_0.1.16                   igraph_1.5.1                 
[127] survival_3.5-7                yaml_2.3.7                   
[129] htmltools_0.5.7               memoise_2.0.1                
[131] locfit_1.5-9.8                seriation_1.5.1              
[133] IRanges_2.32.0                viridisLite_0.4.2            
[135] RhpcBLASctl_0.23-42           digest_0.6.33                
[137] assertthat_0.2.1              mime_0.12                    
[139] rappdirs_0.3.3                SingleR_2.0.0                
[141] registry_0.5-1                spam_2.10-0                  
[143] RSQLite_2.3.3                 future.apply_1.11.0          
[145] data.table_1.14.8             blob_1.2.4                   
[147] S4Vectors_0.36.2              EBAII.n1.SC.helper_0.0.3     
[149] splines_4.2.3                 Formula_1.2-5                
[151] labeling_0.4.3                Cairo_1.6-1                  
[153] AnnotationHub_3.6.0           RCurl_1.98-1.13              
[155] hms_1.1.3                     colorspace_2.1-0             
[157] base64enc_0.1-3               BiocManager_1.30.22          
[159] GenomicRanges_1.50.2          shape_1.4.6                  
[161] nnet_7.3-19                   sass_0.4.7                   
[163] Rcpp_1.0.11                   RANN_2.6.1                   
[165] circlize_0.4.15               fansi_1.0.5                  
[167] parallelly_1.36.0             R6_2.5.1                     
[169] grid_4.2.3                    ggridges_0.5.4               
[171] lifecycle_1.0.3               bluster_1.8.0                
[173] curl_5.1.0                    leiden_0.4.3                 
[175] jquerylib_0.1.4               Matrix_1.6-1.1               
[177] RcppAnnoy_0.0.21              RColorBrewer_1.1-3           
[179] iterators_1.0.14              spatstat.explore_3.2-5       
[181] stringr_1.5.0                 topGO_2.50.0                 
[183] htmlwidgets_1.6.2             beachmat_2.14.2              
[185] polyclip_1.10-6               biomaRt_2.54.1               
[187] purrr_1.0.2                   crosstalk_1.2.0              
[189] ComplexHeatmap_2.14.0         globals_0.16.2               
[191] htmlTable_2.4.2               patchwork_1.1.3              
[193] spatstat.random_3.2-1         progressr_0.14.0             
[195] codetools_0.2-19              matrixStats_1.0.0            
[197] GO.db_3.16.0                  metapod_1.6.0                
[199] prettyunits_1.2.0             SingleCellExperiment_1.20.1  
[201] dbplyr_2.4.0                  gridBase_0.4-7               
[203] GenomeInfoDb_1.34.9           celldex_1.8.0                
[205] gtable_0.3.4                  DBI_1.1.3                    
[207] stats4_4.2.3                  tensor_1.5                   
[209] httr_1.4.7                    highr_0.10                   
[211] KernSmooth_2.23-22            stringi_1.7.12               
[213] progress_1.2.2                reshape2_1.4.4               
[215] farver_2.1.1                  heatmaply_1.5.0              
[217] annotate_1.76.0               viridis_0.6.4                
[219] Rgraphviz_2.42.0              magick_2.8.1                 
[221] DT_0.30                       xml2_1.3.5                   
[223] BiocNeighbors_1.16.0          geneplotter_1.76.0           
[225] Category_2.64.0               scattermore_1.2              
[227] BiocVersion_3.16.0            DESeq2_1.38.3                
[229] scran_1.26.2                  bit_4.0.5                    
[231] MatrixGenerics_1.10.0         spatstat.data_3.0-3          
[233] pkgconfig_2.0.3               knitr_1.45

¶ EBAII n1 2023 : SINGLE CELL ANALYSIS TRAINING
PROCESSING (I)
Data integration and bias correction ¶

Bastien JOB

Remi MONTAGNE

2023-11-08.10

Load the latest Seurat object

Evaluation of batch effect

Integration with Harmony

Scaling and bias regression

Dimension reduction

Harmony integration

Control : evaluating the residual effect of covariables

Last thoughts: integration vs bias regression

Save results

¶ EBAII n1 2023 : SINGLE CELL ANALYSIS TRAINING PROCESSING (I)Data integration and bias correction ¶

Bastien JOB

Remi MONTAGNE

2023-11-08.10

Load the latest Seurat object

Evaluation of batch effect

Integration with Harmony

Scaling and bias regression

Dimension reduction

Harmony integration

Control : evaluating the residual effect of covariables

Last thoughts: integration vs bias regression

Save results

¶ EBAII n1 2023 : SINGLE CELL ANALYSIS TRAINING
PROCESSING (I)
Data integration and bias correction ¶