Load the latest Seurat objects saved as RDS.
data_dir <- '/shared/projects/2325_ebaii/SingleCell/'
s1 <- readRDS(paste0(data_dir, 'TD_DATA/DATA_START/INTEGRATION/TD3A_LogNorm.RDS'))
s2 <- readRDS(paste0(data_dir, 'TD_DATA/DATA_START/INTEGRATION/TDCT_LogNorm.RDS'))
Have a look at the description of the data
EBAII.n1.SC.helper::seurat4_descriptor(sobj = s1, describe = "assay")
OBJECT VERSION : 5.0.0
PROJECT : [TD3A]
[ASSAYS]
ASSAY 1 : [RNA] [ACTIVE]
SLOT 1 : [counts] Dims:[12508 x 4038] Range:[0.00-1588.00]
Counts : 14263194
Sparsity : 86.87712%
SLOT 2 : [data] Dims:[12508 x 4038] Range:[0.00-7.71]
Sparsity : 86.87712%
SLOT 3 : [scale.data] Dims:[0 x 0]
EBAII.n1.SC.helper::seurat4_descriptor(sobj = s2, describe = "assay")
OBJECT VERSION : 5.0.0
PROJECT : [TDCT]
[ASSAYS]
ASSAY 1 : [RNA] [ACTIVE]
SLOT 1 : [counts] Dims:[12254 x 3671] Range:[0.00-1010.00]
Counts : 15936670
Sparsity : 87.47972%
SLOT 2 : [data] Dims:[12254 x 3671] Range:[0.00-7.20]
Sparsity : 87.47972%
SLOT 3 : [scale.data] Dims:[0 x 0]
First, we will merge (concatenate) the data.
sobj <- merge(s1, s2, merge.data = TRUE)
dim(s1)
[1] 12508 4038
dim(s2)
[1] 12254 3671
dim(sobj)
[1] 12926 7709
rm(s1, s2)
We will have a look at their low dimension representation. With the help of the quick visualization function, we will plot UMAP and color the cells according to their sample of origin.
VIZ <- EBAII.n1.SC.helper::QnD_viz(sobj = sobj, assay = 'RNA', slot = 'data', group_by = 'orig.ident', return_object = TRUE)
VIZ
An object of class Seurat
12926 features across 7709 samples within 1 assay
Active assay: RNA (12926 features, 2000 variable features)
3 layers present: counts, data, scale.data
2 dimensional reductions calculated: pca, umap
Cells are clearly separated depending on their sample of origin. There is a batch effect that needs to be corrected.
Harmony works on normalized and scaled data. We will need to scale the new merged object. Scaling is also the good time to try to correct technical sources of biases.
Indeed, some technical factors can bias the data: they will affect the UMAP, the future clustering…
These technical factors can be the number of
UMIs(nCount_RNA
), the percent of mitochondrial RNA… All the
factor that we see in the metadata table could
be a source of bias.
head(sobj@meta.data)
orig.ident nCount_RNA nFeature_RNA log10_nCount_RNA
AAACCTGAGACGCTTT.1_1 TD3A 2807 1588 3.449324
AAACCTGAGGCATTGG.1_1 TD3A 2072 1365 3.316599
AAACCTGGTCAACATC.1_1 TD3A 2024 1340 3.306639
AAACCTGTCGAGGTAG.1_1 TD3A 1877 1241 3.273696
AAACCTGTCGATCCCT.1_1 TD3A 2213 1438 3.345766
AAACGGGCACTTACGA.1_1 TD3A 2445 1428 3.388456
percent_mt percent_rb percent_st CC_Seurat_S.Score
AAACCTGAGACGCTTT.1_1 0.01670814 0.12193388 0.02950587 -0.3206662
AAACCTGAGGCATTGG.1_1 0.03426641 0.05260618 0.02702703 -0.2089766
AAACCTGGTCAACATC.1_1 0.01975309 0.08592593 0.02419753 -0.2021102
AAACCTGTCGAGGTAG.1_1 0.02930208 0.08044752 0.03036761 -0.2546360
AAACCTGTCGATCCCT.1_1 0.03203971 0.07626354 0.02933213 -0.2495052
AAACGGGCACTTACGA.1_1 0.03312883 0.08957055 0.03353783 -0.3101915
CC_Seurat_G2M.Score CC_Seurat_Phase CC_Seurat_SmG2M.Score
AAACCTGAGACGCTTT.1_1 -0.14298427 G1 -0.17768197
AAACCTGAGGCATTGG.1_1 -0.15818260 G1 -0.05079398
AAACCTGGTCAACATC.1_1 -0.17575067 G1 -0.02635950
AAACCTGTCGAGGTAG.1_1 -0.01628821 G1 -0.23834777
AAACCTGTCGATCCCT.1_1 -0.13375529 G1 -0.11574990
AAACGGGCACTTACGA.1_1 -0.18062013 G1 -0.12957140
doublet_scds.hybrid doublet_scDblFinder doublet_union
AAACCTGAGACGCTTT.1_1 FALSE FALSE FALSE
AAACCTGAGGCATTGG.1_1 FALSE FALSE FALSE
AAACCTGGTCAACATC.1_1 FALSE FALSE FALSE
AAACCTGTCGAGGTAG.1_1 FALSE FALSE FALSE
AAACCTGTCGATCCCT.1_1 FALSE FALSE FALSE
AAACGGGCACTTACGA.1_1 FALSE FALSE FALSE
doublet_viz
AAACCTGAGACGCTTT.1_1 singlet
AAACCTGAGGCATTGG.1_1 singlet
AAACCTGGTCAACATC.1_1 singlet
AAACCTGTCGAGGTAG.1_1 singlet
AAACCTGTCGATCCCT.1_1 singlet
AAACGGGCACTTACGA.1_1 singlet
We will check if this is the case with a helper function
EBAII.n1.SC.helper::dimred_covar_cor(sobj = VIZ, dimred = 'pca', ncomp = 20)
The heatmap shows that the contribution of the technical factors to the PCA dimension. The strength of this contribution is indicated by the intensity of the blue color.
Ideally, if the factor is not a source of bias, its contribution to all dimensions is roughly the same. If the factor contributes a lot to a few dimensions, we should consider regressing it out.
Here we will need to remove the influence of these technical factor.
We say that we regress out some variables. This is done within the
ScaleData()
function.
! technical issue: this step take time, we will not run the following command. We will load a scaled object instead.
## technical issue: on the cluster it take too much time, we will not run this command
## we will load a preconstructed file instead.
vtr = c('nCount_RNA', 'percent_rb')
sobj <- Seurat::ScaleData(object = sobj, assay = 'RNA', nfeatures = 2000, vars.to.regress = vtr)
sobj <- Seurat::FindVariableFeatures(sobj)
saveRDS(sobj, paste0(data_dir, 'TD_DATA/DATA_START/INTEGRATION/object_to_integrate.RDS'))
sobj <- readRDS(paste0(data_dir, 'TD_DATA/DATA_START/INTEGRATION/object_to_integrate.RDS'))
sobj <- Seurat::RunPCA(object = sobj, npcs = 50, seed.use = my_seed, verbose = FALSE)
EBAII.n1.SC.helper::dimred_covar_cor(sobj = sobj, dimred = 'pca', ncomp = 20)
Harmony works in low dimension. We have just applied PCA to the data.
Remind that PCA replaces the thousand of genes by a small number of dimesions that summarize most of the information.
For integration, we want to use only the informative dimensions. We will choose them with a elbow plot.
### Finding the number of dims to use ====
Seurat::ElbowPlot(object = sobj, reduction = 'pca', ndims = 50)
## Result : we will keep 20 dimensions for Harmony
KDIM <- 20
The dataset is quite simple. Here we see that the 20 first dimensions summarize most of the information (most of the variability). So we will only need to make Harmony work on those 20 dimensions.
For more complex datasets, the default value is generally 50, but it needs to be adjusted to each specific dataset.
We are now ready to apply Harmony to integrated the different datasets.
sobj <- harmony::RunHarmony(object = sobj, reduction.use = 'pca', dims.use = 1:KDIM, group.by.vars = 'orig.ident', max_iter = 100) ## or sigma .2
We removed the principal sources of biases and integrated the datasets. We can once again look at the heatmap of the effect of the technical factors to make sure that we remove the majority of the
VIZ <- EBAII.n1.SC.helper::QnD_viz(sobj = sobj, assay = 'harmony', slot = NULL, dimred = 'harmony', ncomp = KDIM, group_by = 'orig.ident', return_object = TRUE)
EBAII.n1.SC.helper::dimred_covar_cor(sobj = VIZ, dimred = 'harmony', ncomp = 20)
We saw earlier that the columns of the metadata can be seen as potential technical biases to correct.
colnames(sobj@meta.data)
[1] "orig.ident" "nCount_RNA" "nFeature_RNA"
[4] "log10_nCount_RNA" "percent_mt" "percent_rb"
[7] "percent_st" "CC_Seurat_S.Score" "CC_Seurat_G2M.Score"
[10] "CC_Seurat_Phase" "CC_Seurat_SmG2M.Score" "doublet_scds.hybrid"
[13] "doublet_scDblFinder" "doublet_union" "doublet_viz"
One of the metadata columns, orig.ident
indicates the
sample of origin, that is to say the information about the batch. This
is precisely the “bias” that we try to remove when we perform
integration.
So why can’t we just regress out orig.ident
as any other
column ?
Earlier, we regressed out nCount_RNA
and
percent_rb
with the command:
vtr = c('nCount_RNA', 'percent_rb')
sobj <- Seurat::ScaleData(object = sobj, assay = 'RNA', nfeatures = 2000, vars.to.regress = vtr)
Why do we need special integration methods, couldn’t we just run the command
vtr = c('nCount_RNA', 'percent_rb', 'orig.ident')
sobj <- Seurat::ScaleData(object = sobj, assay = 'RNA', nfeatures = 2000, vars.to.regress = vtr)
Let’s have a try.
vtr <- 'orig.ident'
sobj <- Seurat::FindVariableFeatures(object = sobj, assay = 'RNA', nfeatures = 2000)
sobj <- Seurat::ScaleData(object = sobj, assay = 'RNA', vars.to.regress = vtr)
VIZ <- EBAII.n1.SC.helper::QnD_viz(sobj = sobj, assay = 'RNA', slot = 'scale.data', group_by = 'orig.ident', return_object = TRUE)
EBAII.n1.SC.helper::dimred_covar_cor(sobj = VIZ, dimred = 'pca', ncomp = 20)
Correcting only this variable is not enough.
vtr <- c('orig.ident', 'nCount_RNA', 'percent_rb')
sobj <- Seurat::FindVariableFeatures(object = sobj, assay = 'RNA', nfeatures = 2000)
sobj <- Seurat::ScaleData(object = sobj, assay = 'RNA', vars.to.regress = vtr)
VIZ <- EBAII.n1.SC.helper::QnD_viz(sobj = sobj, assay = 'RNA', slot = 'scale.data', group_by = 'orig.ident', return=TRUE)
EBAII.n1.SC.helper::dimred_covar_cor(sobj = VIZ, dimred = 'pca', ncomp = 20)
This is better but not as good as Harmony.
Actually, the batch effect has a very complex structure while the method used for bias correction is quite simple. It isl not able to correct the batch effect in a proper manner. It might not align correctly the cell populations or over-correct the data and align different cell populations !
The integration methods were developed specifically to account for the complexity of the batch effect and give better results.
saveRDS(object = sobj, file = './TD3A_integrated.RDS',
compress = 'bzip2')
Rsession
utils::sessionInfo()
R version 4.2.3 (2023-03-15)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS
Matrix products: default
BLAS/LAPACK: /shared/ifbstor1/software/miniconda/envs/r-4.2.3/lib/libopenblasp-r0.3.21.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Biobase_2.58.0 BiocGenerics_0.44.0 SeuratObject_5.0.0
[4] Seurat_4.4.0
loaded via a namespace (and not attached):
[1] rsvd_1.0.5 Hmisc_5.1-1
[3] ica_1.0-3 foreach_1.5.2
[5] lmtest_0.9-40 crayon_1.5.2
[7] MASS_7.3-60 nlme_3.1-163
[9] backports_1.4.1 rlang_1.1.2
[11] XVector_0.38.0 ROCR_1.0-11
[13] irlba_2.3.5.1 SparseM_1.81
[15] SoupX_1.6.2 limma_3.54.2
[17] ca_0.71.1 filelock_1.0.2
[19] pcaExplorer_2.24.0 GOstats_2.64.0
[21] BiocParallel_1.32.6 rjson_0.2.21
[23] bit64_4.0.5 glue_1.6.2
[25] harmony_1.1.0 pheatmap_1.0.12
[27] rngtools_1.5.2 sctransform_0.4.1
[29] parallel_4.2.3 spatstat.sparse_3.0-3
[31] AnnotationDbi_1.60.2 shinyAce_0.4.2
[33] shinydashboard_0.7.2 dotCall64_1.1-0
[35] spatstat.geom_3.2-7 tidyselect_1.2.0
[37] SummarizedExperiment_1.28.0 fitdistrplus_1.1-11
[39] XML_3.99-0.15 tidyr_1.3.0
[41] zoo_1.8-12 xtable_1.8-4
[43] magrittr_2.0.3 evaluate_0.23
[45] ggplot2_3.4.4 scuttle_1.8.4
[47] cli_3.6.1 zlibbioc_1.44.0
[49] rstudioapi_0.15.0 miniUI_0.1.1.1
[51] sp_2.1-1 bslib_0.5.1
[53] rpart_4.1.21 shiny_1.7.5.1
[55] BiocSingular_1.14.0 xfun_0.41
[57] clue_0.3-65 cluster_2.1.4
[59] TSP_1.2-4 KEGGREST_1.38.0
[61] tibble_3.2.1 interactiveDisplayBase_1.36.0
[63] ggrepel_0.9.4 threejs_0.3.3
[65] listenv_0.9.0 dendextend_1.17.1
[67] Biostrings_2.66.0 png_0.1-8
[69] future_1.33.0 withr_2.5.2
[71] shinyBS_0.61.1 bitops_1.0-7
[73] RBGL_1.74.0 plyr_1.8.9
[75] GSEABase_1.60.0 dqrng_0.3.1
[77] pillar_1.9.0 GlobalOptions_0.1.2
[79] cachem_1.0.8 GetoptLong_1.0.5
[81] DelayedMatrixStats_1.20.0 vctrs_0.6.4
[83] ellipsis_0.3.2 generics_0.1.3
[85] NMF_0.26 tools_4.2.3
[87] foreign_0.8-85 munsell_0.5.0
[89] DelayedArray_0.24.0 fastmap_1.1.1
[91] compiler_4.2.3 abind_1.4-5
[93] httpuv_1.6.12 ExperimentHub_2.6.0
[95] plotly_4.10.3 GenomeInfoDbData_1.2.9
[97] gridExtra_2.3 edgeR_3.40.2
[99] lattice_0.22-5 deldir_1.0-9
[101] AnnotationForge_1.40.2 utf8_1.2.4
[103] later_1.3.1 dplyr_1.1.3
[105] BiocFileCache_2.6.1 jsonlite_1.8.7
[107] scales_1.2.1 graph_1.76.0
[109] ScaledMatrix_1.6.0 pbapply_1.7-2
[111] sparseMatrixStats_1.10.0 genefilter_1.80.3
[113] lazyeval_0.2.2 promises_1.2.1
[115] doParallel_1.0.17 goftest_1.2-3
[117] spatstat.utils_3.0-4 reticulate_1.34.0
[119] checkmate_2.3.0 rmarkdown_2.25
[121] cowplot_1.1.1 statmod_1.5.0
[123] webshot_0.5.5 Rtsne_0.16
[125] uwot_0.1.16 igraph_1.5.1
[127] survival_3.5-7 yaml_2.3.7
[129] htmltools_0.5.7 memoise_2.0.1
[131] locfit_1.5-9.8 seriation_1.5.1
[133] IRanges_2.32.0 viridisLite_0.4.2
[135] RhpcBLASctl_0.23-42 digest_0.6.33
[137] assertthat_0.2.1 mime_0.12
[139] rappdirs_0.3.3 SingleR_2.0.0
[141] registry_0.5-1 spam_2.10-0
[143] RSQLite_2.3.3 future.apply_1.11.0
[145] data.table_1.14.8 blob_1.2.4
[147] S4Vectors_0.36.2 EBAII.n1.SC.helper_0.0.3
[149] splines_4.2.3 Formula_1.2-5
[151] labeling_0.4.3 Cairo_1.6-1
[153] AnnotationHub_3.6.0 RCurl_1.98-1.13
[155] hms_1.1.3 colorspace_2.1-0
[157] base64enc_0.1-3 BiocManager_1.30.22
[159] GenomicRanges_1.50.2 shape_1.4.6
[161] nnet_7.3-19 sass_0.4.7
[163] Rcpp_1.0.11 RANN_2.6.1
[165] circlize_0.4.15 fansi_1.0.5
[167] parallelly_1.36.0 R6_2.5.1
[169] grid_4.2.3 ggridges_0.5.4
[171] lifecycle_1.0.3 bluster_1.8.0
[173] curl_5.1.0 leiden_0.4.3
[175] jquerylib_0.1.4 Matrix_1.6-1.1
[177] RcppAnnoy_0.0.21 RColorBrewer_1.1-3
[179] iterators_1.0.14 spatstat.explore_3.2-5
[181] stringr_1.5.0 topGO_2.50.0
[183] htmlwidgets_1.6.2 beachmat_2.14.2
[185] polyclip_1.10-6 biomaRt_2.54.1
[187] purrr_1.0.2 crosstalk_1.2.0
[189] ComplexHeatmap_2.14.0 globals_0.16.2
[191] htmlTable_2.4.2 patchwork_1.1.3
[193] spatstat.random_3.2-1 progressr_0.14.0
[195] codetools_0.2-19 matrixStats_1.0.0
[197] GO.db_3.16.0 metapod_1.6.0
[199] prettyunits_1.2.0 SingleCellExperiment_1.20.1
[201] dbplyr_2.4.0 gridBase_0.4-7
[203] GenomeInfoDb_1.34.9 celldex_1.8.0
[205] gtable_0.3.4 DBI_1.1.3
[207] stats4_4.2.3 tensor_1.5
[209] httr_1.4.7 highr_0.10
[211] KernSmooth_2.23-22 stringi_1.7.12
[213] progress_1.2.2 reshape2_1.4.4
[215] farver_2.1.1 heatmaply_1.5.0
[217] annotate_1.76.0 viridis_0.6.4
[219] Rgraphviz_2.42.0 magick_2.8.1
[221] DT_0.30 xml2_1.3.5
[223] BiocNeighbors_1.16.0 geneplotter_1.76.0
[225] Category_2.64.0 scattermore_1.2
[227] BiocVersion_3.16.0 DESeq2_1.38.3
[229] scran_1.26.2 bit_4.0.5
[231] MatrixGenerics_1.10.0 spatstat.data_3.0-3
[233] pkgconfig_2.0.3 knitr_1.45