This file describes the different steps to perform first part of data processing for the single cell RNAseq data analysis training course for the EBAII n1 2023, covering these steps : * Normalization of the raw counts * Scaling
Set your working directory in your TD output directory (golf is mine!)
Load the latest Seurat object saved as RDS
But, how ?
Until now, we manipulated a raw count matrix.
During QC / preprocessing, we observed that cells expression profiles had a wide variety in multiple metrics, be it their number of expressed genes, their amount of captured RNA molecules, their rate of expression involved in mitochondrion genes, etc…
Consequently, we have to harmonize the data to try to identify biological differences drowned into artefactual differences.
This is what is performed during normalization.
Here, we will use the two main normalization methods available in Seurat :
Let’s describe this modified Seurat object, especially the ‘data’ slot
We can visualize the modification performed through
normalization using boxplots.
To understand why the LogNorm is not just a log-scale transformation of counts, we will display the raw counts in log scale, for comparison
Boxplot :
boxplot(
log10(as.matrix(
Seurat::GetAssayData(
object = sobj,
assay = 'RNA',
slot = 'counts'))[,1:10] +1),
main = 'Counts',
xlab = 'Cells',
ylab = 'Raw counts (log-scale)')
Normalization :
How ?
Description :
OBJECT VERSION : 5.0.0
PROJECT : [TD3A]
[ASSAYS]
ASSAY 1 : [RNA] [ACTIVE]
SLOT 1 : [counts] Dims:[12508 x 4038] Range:[0.00-1588.00]
Counts : 14263194
Sparsity : 86.87712%
SLOT 2 : [data] Dims:[12508 x 4038] Range:[0.00-7.71]
Sparsity : 86.87712%
SLOT 3 : [scale.data] Dims:[0 x 0]
Boxplot :
boxplot(
as.matrix(
Seurat::GetAssayData(
object = sobj,
assay = 'RNA',
slot = 'data'))[,1:10],
main = 'Data (LogNorm)',
xlab = 'Cells',
ylab = 'LN values')
Visualization :
Normalization :
How ?
Description :
## Warning : here we're dealing with sobjSCT
EBAII.n1.SC.helper::seurat4_descriptor(
sobj = sobjSCT,
describe = 'assay')
OBJECT VERSION : 5.0.0
PROJECT : [TD3A]
[ASSAYS]
ASSAY 1 : [RNA]
SLOT 1 : [counts] Dims:[12508 x 4038] Range:[0.00-1588.00]
Counts : 14263194
Sparsity : 86.87712%
SLOT 2 : [data] Dims:[12508 x 4038] Range:[0.00-7.71]
Sparsity : 86.87712%
SLOT 3 : [scale.data] Dims:[0 x 0]
ASSAY 2 : [SCT] [ACTIVE]
SLOT 1 : [counts] Dims:[12309 x 4038] Range:[0.00-510.00]
Counts : 9723658
Sparsity : 88.36753%
SLOT 2 : [data] Dims:[12309 x 4038] Range:[0.00-6.24]
Sparsity : 88.36753%
SLOT 3 : [scale.data] Dims:[3000 x 4038] Range:[-3.98-11.68]
Boxplot :
boxplot(as.matrix(
Seurat::GetAssayData(
object = sobjSCT, assay = 'SCT',
slot = 'data'))[,1:10], main = 'Data (SCTv2)',
xlab = 'Cells', ylab = 'SCTv2 values')
Visualization :
Question : Does the normalization method have a high impact on the cell space ?
For further steps, we will keep using the LN version.
that is a standard pre-processing step prior to dimensional reduction techniques like PCA. The ScaleData() function:
Shifts the expression of each gene, so that the mean expression across cells is 0
Scales the expression of each gene, so that the variance across cells is 1
This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate
The results of this are stored in pbmc[["RNA"]]$scale.data
By default, only variable features are scaled.
You can specify the features argument to scale additional features
Normalization considered the homogenization of expression values at the cell level.
Despite this, the next important step that is dimensional reduction is heavily sensitive to variations of the range of expression that remain in the normalized expression data, and that would impair their expected behavior.
Thus, we need to transform data in a harder way just for this dimensional reduction step, which is performed through scaling :
This transformation gives equal weight to features in downstream analyses, so that highly-expressed genes do not dominate.
To be efficient in the single cell context, data scaling relies on a selection of features of importance.
By default, the metric used to select these is their contribution to the global variance of the expression matrix (then called “HVG”s, for “highly variable genes”).
Thus, the main parameter of importance at this step is the number of features used in this selection.
To assess this parameter importance, we will split the audience in 4 subgroups :
But first… how ?
Feature selection :
sobj <- Seurat::FindVariableFeatures(
object = sobj,
assay = 'RNA',
nfeatures = 500,
verbose = FALSE)
## Show the last selected features
str(rev(sobj@assays$RNA@var.features))
chr [1:500] "Gm2682" "Fdps" "Hist1h2bh" "Trav4-2" "Rpl10a" "Cxcl9" "Paics" ...
Scaling :
Description :
OBJECT VERSION : 5.0.0
PROJECT : [TD3A]
[ASSAYS]
ASSAY 1 : [RNA] [ACTIVE]
SLOT 1 : [counts] Dims:[12508 x 4038] Range:[0.00-1588.00]
Counts : 14263194
Sparsity : 86.87712%
SLOT 2 : [data] Dims:[12508 x 4038] Range:[0.00-7.71]
Sparsity : 86.87712%
SLOT 3 : [scale.data] Dims:[500 x 4038] Range:[-2.55-10.00]
Visualization :
Feature selection :
sobj <- Seurat::FindVariableFeatures(
object = sobj,
assay = 'RNA',
nfeatures = 1000,
verbose = FALSE)
## Show the last selected features
str(rev(sobj@assays$RNA@var.features))
chr [1:1000] "Fmc1" "Spin2c" "Traj39" "Mrps16" "Hist1h4a" "Cd72" "Ywhae" ...
Scaling :
Description :
OBJECT VERSION : 5.0.0
PROJECT : [TD3A]
[ASSAYS]
ASSAY 1 : [RNA] [ACTIVE]
SLOT 1 : [counts] Dims:[12508 x 4038] Range:[0.00-1588.00]
Counts : 14263194
Sparsity : 86.87712%
SLOT 2 : [data] Dims:[12508 x 4038] Range:[0.00-7.71]
Sparsity : 86.87712%
SLOT 3 : [scale.data] Dims:[1000 x 4038] Range:[-5.59-10.00]
Visualization :
Feature selection :
sobj <- Seurat::FindVariableFeatures(
object = sobj,
assay = 'RNA',
nfeatures = 2500,
verbose = FALSE)
## Show the last selected features
str(rev(sobj@assays$RNA@var.features))
chr [1:2500] "Rpgrip1" "St3gal6" "Zranb2" "Eif3f" "Dus1l" "Drg1" "Spats2l" ...
Scaling :
Description :
OBJECT VERSION : 5.0.0
PROJECT : [TD3A]
[ASSAYS]
ASSAY 1 : [RNA] [ACTIVE]
SLOT 1 : [counts] Dims:[12508 x 4038] Range:[0.00-1588.00]
Counts : 14263194
Sparsity : 86.87712%
SLOT 2 : [data] Dims:[12508 x 4038] Range:[0.00-7.71]
Sparsity : 86.87712%
SLOT 3 : [scale.data] Dims:[2500 x 4038] Range:[-5.76-10.00]
Visualization :
Feature selection :
sobj <- Seurat::FindVariableFeatures(
object = sobj,
assay = 'RNA',
nfeatures = 5000,
verbose = FALSE)
## Show the last selected features
str(rev(sobj@assays$RNA@var.features))
chr [1:5000] "Zfp365" "Zdhhc13" "Ndufs5" "Hibch" "Nelfcd" "Mrpl38" "Crem" ...
Scaling :
Description :
OBJECT VERSION : 5.0.0
PROJECT : [TD3A]
[ASSAYS]
ASSAY 1 : [RNA] [ACTIVE]
SLOT 1 : [counts] Dims:[12508 x 4038] Range:[0.00-1588.00]
Counts : 14263194
Sparsity : 86.87712%
SLOT 2 : [data] Dims:[12508 x 4038] Range:[0.00-7.71]
Sparsity : 86.87712%
SLOT 3 : [scale.data] Dims:[5000 x 4038] Range:[-5.87-10.00]
Visualization :
Actually, we will use the default value for LogNormalized data : 2000 HVGs
## Feature selection
sobj <- Seurat::FindVariableFeatures(
object = sobj,
assay = 'RNA',
nfeatures = 5000,
verbose = FALSE)
## Scaling
sobj <- Seurat::ScaleData(
object = sobj,
assay = 'RNA',
verbose = FALSE)
We can look at the scaled distribution :
We can now save the results of our hard work !
Rsession
R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=fr_FR.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=fr_FR.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Paris
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] S4Vectors_0.40.1 Biobase_2.62.0 BiocGenerics_0.48.0
loaded via a namespace (and not attached):
[1] matrixStats_1.0.0 spatstat.sparse_3.0-3
[3] bitops_1.0-7 webshot_0.5.5
[5] doParallel_1.0.17 httr_1.4.7
[7] RColorBrewer_1.1-3 Rgraphviz_2.46.0
[9] tools_4.3.1 sctransform_0.4.1
[11] backports_1.4.1 DT_0.30
[13] utf8_1.2.4 R6_2.5.1
[15] lazyeval_0.2.2 uwot_0.1.16
[17] GetoptLong_1.0.5 withr_2.5.2
[19] sp_2.1-1 prettyunits_1.2.0
[21] gridExtra_2.3 progressr_0.14.0
[23] cli_3.6.1 spatstat.explore_3.2-5
[25] TSP_1.2-4 labeling_0.4.3
[27] sass_0.4.7 topGO_2.54.0
[29] Seurat_4.4.0 spatstat.data_3.0-3
[31] genefilter_1.84.0 ggridges_0.5.4
[33] pbapply_1.7-2 foreign_0.8-85
[35] AnnotationForge_1.44.0 parallelly_1.36.0
[37] limma_3.58.1 rstudioapi_0.15.0
[39] RSQLite_2.3.2 shape_1.4.6
[41] GOstats_2.68.0 generics_0.1.3
[43] ica_1.0-3 spatstat.random_3.2-1
[45] crosstalk_1.2.0 dendextend_1.17.1
[47] dplyr_1.1.3 GO.db_3.18.0
[49] Matrix_1.6-1.1 fansi_1.0.5
[51] abind_1.4-5 lifecycle_1.0.3
[53] SoupX_1.6.2 yaml_2.3.7
[55] edgeR_4.0.1 SummarizedExperiment_1.30.2
[57] glmGamPoi_1.14.0 SparseArray_1.2.0
[59] BiocFileCache_2.10.1 Rtsne_0.16
[61] grid_4.3.1 blob_1.2.4
[63] promises_1.2.1 dqrng_0.3.1
[65] ExperimentHub_2.10.0 crayon_1.5.2
[67] shinydashboard_0.7.2 miniUI_0.1.1.1
[69] lattice_0.22-5 beachmat_2.18.0
[71] cowplot_1.1.1 annotate_1.80.0
[73] KEGGREST_1.42.0 ComplexHeatmap_2.18.0
[75] pillar_1.9.0 knitr_1.45
[77] metapod_1.10.0 GenomicRanges_1.54.1
[79] rjson_0.2.21 future.apply_1.11.0
[81] codetools_0.2-19 leiden_0.4.3
[83] glue_1.6.2 data.table_1.14.8
[85] vctrs_0.6.4 png_0.1-8
[87] spam_2.10-0 gtable_0.3.4
[89] assertthat_0.2.1 cachem_1.0.8
[91] xfun_0.41 pcaExplorer_2.28.0
[93] S4Arrays_1.2.0 mime_0.12
[95] survival_3.5-7 pheatmap_1.0.12
[97] seriation_1.5.1 SingleCellExperiment_1.22.0
[99] iterators_1.0.14 statmod_1.5.0
[101] bluster_1.12.0 interactiveDisplayBase_1.40.0
[103] ellipsis_0.3.2 fitdistrplus_1.1-11
[105] ROCR_1.0-11 Category_2.68.0
[107] nlme_3.1-163 bit64_4.0.5
[109] threejs_0.3.3 progress_1.2.2
[111] filelock_1.0.2 RcppAnnoy_0.0.21
[113] GenomeInfoDb_1.38.0 bslib_0.5.1
[115] irlba_2.3.5.1 KernSmooth_2.23-22
[117] rpart_4.1.21 colorspace_2.1-0
[119] DBI_1.1.3 Hmisc_5.1-1
[121] celldex_1.12.0 nnet_7.3-19
[123] DESeq2_1.42.0 tidyselect_1.2.0
[125] bit_4.0.5 compiler_4.3.1
[127] curl_5.1.0 graph_1.80.0
[129] htmlTable_2.4.2 BiocNeighbors_1.20.0
[131] SparseM_1.81 xml2_1.3.5
[133] DelayedArray_0.28.0 plotly_4.10.3
[135] checkmate_2.2.0 scales_1.2.1
[137] lmtest_0.9-40 RBGL_1.78.0
[139] NMF_0.26 rappdirs_0.3.3
[141] stringr_1.5.0 digest_0.6.33
[143] goftest_1.2-3 shinyBS_0.61.1
[145] spatstat.utils_3.0-4 rmarkdown_2.25
[147] ca_0.71.1 XVector_0.42.0
[149] htmltools_0.5.6.1 pkgconfig_2.0.3
[151] base64enc_0.1-3 SingleR_2.4.0
[153] sparseMatrixStats_1.14.0 MatrixGenerics_1.14.0
[155] highr_0.10 dbplyr_2.4.0
[157] fastmap_1.1.1 GlobalOptions_0.1.2
[159] rlang_1.1.1 htmlwidgets_1.6.2
[161] shiny_1.7.5.1 DelayedMatrixStats_1.24.0
[163] farver_2.1.1 jquerylib_0.1.4
[165] zoo_1.8-12 jsonlite_1.8.7
[167] BiocParallel_1.36.0 BiocSingular_1.18.0
[169] RCurl_1.98-1.12 magrittr_2.0.3
[171] Formula_1.2-5 scuttle_1.12.0
[173] GenomeInfoDbData_1.2.11 dotCall64_1.1-0
[175] patchwork_1.1.3 munsell_0.5.0
[177] Rcpp_1.0.11 viridis_0.6.4
[179] reticulate_1.34.0 stringi_1.7.12
[181] zlibbioc_1.48.0 MASS_7.3-60
[183] AnnotationHub_3.10.0 plyr_1.8.9
[185] parallel_4.3.1 listenv_0.9.0
[187] ggrepel_0.9.4 deldir_1.0-9
[189] Biostrings_2.70.1 splines_4.3.1
[191] tensor_1.5 circlize_0.4.15
[193] hms_1.1.3 locfit_1.5-9.8
[195] igraph_1.5.1 spatstat.geom_3.2-7
[197] rngtools_1.5.2 reshape2_1.4.4
[199] biomaRt_2.58.0 stats4_4.3.1
[201] ScaledMatrix_1.10.0 BiocVersion_3.18.0
[203] XML_3.99-0.14 evaluate_0.23
[205] SeuratObject_5.0.0 BiocManager_1.30.22
[207] scran_1.30.0 foreach_1.5.2
[209] httpuv_1.6.12 RANN_2.6.1
[211] tidyr_1.3.0 purrr_1.0.2
[213] polyclip_1.10-6 clue_0.3-65
[215] heatmaply_1.5.0 future_1.33.0
[217] scattermore_1.2 ggplot2_3.4.4
[219] gridBase_0.4-7 rsvd_1.0.5
[221] xtable_1.8-4 later_1.3.1
[223] viridisLite_0.4.2 tibble_3.2.1
[225] EBAII.n1.SC.helper_0.0.2 registry_0.5-1
[227] memoise_2.0.1 AnnotationDbi_1.64.0
[229] IRanges_2.36.0 cluster_2.1.4
[231] globals_0.16.2 GSEABase_1.64.0
[233] shinyAce_0.4.2