This file describes the different steps to perform first part of data processing for the single cell RNAseq data analysis training course for the EBAII n1 2023, covering these steps :
First, we will set some parameters
## Fixed seed
my_seed <- 1337
## Your project name on the IFB cluster (CHANGE IT FOR YOURS !)
my_project <- 'golf'
You can load the latest Seurat object you saved as a RDS at the preceding step (Proc.1 : normalization with LN, and scaling using 2000 HVGs) :
This step originates from the observation that we do not want nor need to characterize each of our thousends of cells, but groups of them (clusters ? cell types ? other ?). Thus, we do no need all data, and even may benefit from such a reduction :
Here, we will use the grand-mother of all : the PCA (Principal Component Analysis)
But how ? (did you see it coming ?)
Question 1 : How many principal components (PC) will be generated by default ?
## . The answer is 50 (npcs parameter)
## . We will use this default value.
## . Warning : in some (rare) contexts, this
## may not be enough !
Question 2 : Which data type (ie, which Seurat object slot) will be used to generate the components ?
## . Data from the @scale.data will be used
## . This is unfortunately not explicit
## from the Seurat::RunPCA help page !
Perform PCA on our data
Visualization of the very first two components :
Question 1 : Give me your interpretation / feelings from this plot !
Question 2 : Should we stop at using 2 dimensions to interpret our data ?
Description :
OBJECT VERSION : 5.0.0
PROJECT : [TD3A]
[DIMREDS]
DIMRED 1 : [pca] Dims:[4038 x 50]
This final processing step required to finaly observe our data requires a novel dimension reduction method with a very high challeng to overcome : reduce a space of dozens of dimensions to just 2, or 3 !
We will use the UMAP method.
BUT HOW ? (I’m pretty sure you saw it coming this time)
So, we now have to choose the number of PCA dimensions to use for this UMAP reduction.
Question : Do you have an idea of this number ?
## . Impossible to guess with our current knowledge.
## . But we can get some help from the PCA data itself
## . If you said a value above the 50 components we
## generated for our PCA, you should wear the
## cone of shame !
We will use a very simple graphical method : the observation of the amount of global variance explained by each component.
Question : Any more precise idea ?
To demonstrate the effect of the number of PC dimensions used as input to the UMAP generation, I will perform a DEMO using 4 different PC values : 3, 7, 23 and 49.
Question : Your conclusion ?
## . Very few PCs are not able to retrieve
## a sufficiently defined structure.
## . The differences between 25 and 49 are
## limited in the global structure.
## This may imply that the additional
## components above 25 do not add enough
## noise to alter this structure.
You can now perform a final UMAP with the PC dimensions of your choice.
For the next steps of the training, I used 20 dimensions.
You can now save the results of your hard work :
While by default Seurat::RunUMAP will produce 2-dimension reductions, the method can generate further components.
Despite our limited brain, this is sometimes interesting and useful to attempt a reduction to 3 dimensions instead of 2. This can be very effective when looking for trajectories.
We can generate a UMAP from 25 PCs, requesting 3 UMAP components :
## UMAP from 25 PCs, 3 components requested
sobj <- Seurat::RunUMAP(
object = sobj, assay = 'RNA',
graph.name = 'RNA_snn',
reduction = 'pca', dims = 1:25,
seed.use = my_seed,
n.components = 3)
## DimPlot of the first 2 UMAP components
Seurat::DimPlot(
object = sobj,
reduction = 'umap')
Question : Isn’t there something striking ?
## . The plot is not the same as when
## using 25 PCs and requesting 2 UMAP
## components instead of 3 here !
## . The 2 components of a 2D UMAP are not
## the same as the two first components
## of a dim>2 UMAP.
Let’s perform a 3D representation of our UMAP
## Structure data to plot in a data.frame
df3d <- as.data.frame(
sobj@reductions$umap@cell.embeddings)
## 3D plot
plotly::plot_ly(
data = df3d, x = ~UMAP_1,
y = ~UMAP_2, z = ~UMAP_3,
type = 'scatter3d',
marker = list(size = 2, width=2))
Rsession
R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=fr_FR.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=fr_FR.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Paris
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Biobase_2.62.0 BiocGenerics_0.48.0
loaded via a namespace (and not attached):
[1] matrixStats_1.0.0 spatstat.sparse_3.0-3
[3] bitops_1.0-7 webshot_0.5.5
[5] doParallel_1.0.17 httr_1.4.7
[7] RColorBrewer_1.1-3 Rgraphviz_2.46.0
[9] tools_4.3.1 sctransform_0.4.1
[11] backports_1.4.1 DT_0.30
[13] utf8_1.2.4 R6_2.5.1
[15] lazyeval_0.2.2 uwot_0.1.16
[17] GetoptLong_1.0.5 withr_2.5.2
[19] sp_2.1-1 prettyunits_1.2.0
[21] gridExtra_2.3 progressr_0.14.0
[23] cli_3.6.1 spatstat.explore_3.2-5
[25] TSP_1.2-4 topGO_2.54.0
[27] labeling_0.4.3 sass_0.4.7
[29] Seurat_4.4.0 spatstat.data_3.0-3
[31] genefilter_1.84.0 ggridges_0.5.4
[33] pbapply_1.7-2 foreign_0.8-85
[35] AnnotationForge_1.44.0 parallelly_1.36.0
[37] limma_3.58.1 rstudioapi_0.15.0
[39] RSQLite_2.3.2 shape_1.4.6
[41] GOstats_2.68.0 generics_0.1.3
[43] crosstalk_1.2.0 ica_1.0-3
[45] spatstat.random_3.2-1 dendextend_1.17.1
[47] dplyr_1.1.3 GO.db_3.18.0
[49] Matrix_1.6-1.1 fansi_1.0.5
[51] S4Vectors_0.40.1 abind_1.4-5
[53] lifecycle_1.0.3 SoupX_1.6.2
[55] yaml_2.3.7 edgeR_4.0.1
[57] SummarizedExperiment_1.30.2 SparseArray_1.2.0
[59] BiocFileCache_2.10.1 Rtsne_0.16
[61] grid_4.3.1 blob_1.2.4
[63] promises_1.2.1 dqrng_0.3.1
[65] ExperimentHub_2.10.0 crayon_1.5.2
[67] shinydashboard_0.7.2 miniUI_0.1.1.1
[69] lattice_0.22-5 beachmat_2.18.0
[71] cowplot_1.1.1 annotate_1.80.0
[73] KEGGREST_1.42.0 ComplexHeatmap_2.18.0
[75] pillar_1.9.0 knitr_1.45
[77] metapod_1.10.0 GenomicRanges_1.54.1
[79] rjson_0.2.21 future.apply_1.11.0
[81] codetools_0.2-19 leiden_0.4.3
[83] glue_1.6.2 data.table_1.14.8
[85] vctrs_0.6.4 png_0.1-8
[87] spam_2.10-0 gtable_0.3.4
[89] assertthat_0.2.1 cachem_1.0.8
[91] xfun_0.41 pcaExplorer_2.28.0
[93] S4Arrays_1.2.0 mime_0.12
[95] survival_3.5-7 pheatmap_1.0.12
[97] seriation_1.5.1 SingleCellExperiment_1.22.0
[99] iterators_1.0.14 statmod_1.5.0
[101] bluster_1.12.0 interactiveDisplayBase_1.40.0
[103] ellipsis_0.3.2 fitdistrplus_1.1-11
[105] ROCR_1.0-11 Category_2.68.0
[107] nlme_3.1-163 bit64_4.0.5
[109] threejs_0.3.3 progress_1.2.2
[111] filelock_1.0.2 RcppAnnoy_0.0.21
[113] GenomeInfoDb_1.38.0 bslib_0.5.1
[115] irlba_2.3.5.1 KernSmooth_2.23-22
[117] rpart_4.1.21 colorspace_2.1-0
[119] DBI_1.1.3 Hmisc_5.1-1
[121] celldex_1.12.0 nnet_7.3-19
[123] DESeq2_1.42.0 tidyselect_1.2.0
[125] bit_4.0.5 compiler_4.3.1
[127] curl_5.1.0 graph_1.80.0
[129] htmlTable_2.4.2 BiocNeighbors_1.20.0
[131] SparseM_1.81 xml2_1.3.5
[133] DelayedArray_0.28.0 plotly_4.10.3
[135] checkmate_2.2.0 scales_1.2.1
[137] lmtest_0.9-40 RBGL_1.78.0
[139] NMF_0.26 rappdirs_0.3.3
[141] stringr_1.5.0 digest_0.6.33
[143] goftest_1.2-3 shinyBS_0.61.1
[145] spatstat.utils_3.0-4 rmarkdown_2.25
[147] ca_0.71.1 XVector_0.42.0
[149] htmltools_0.5.6.1 pkgconfig_2.0.3
[151] base64enc_0.1-3 SingleR_2.4.0
[153] sparseMatrixStats_1.14.0 MatrixGenerics_1.14.0
[155] highr_0.10 dbplyr_2.4.0
[157] fastmap_1.1.1 GlobalOptions_0.1.2
[159] rlang_1.1.1 htmlwidgets_1.6.2
[161] shiny_1.7.5.1 DelayedMatrixStats_1.24.0
[163] farver_2.1.1 jquerylib_0.1.4
[165] zoo_1.8-12 jsonlite_1.8.7
[167] BiocParallel_1.36.0 BiocSingular_1.18.0
[169] RCurl_1.98-1.12 magrittr_2.0.3
[171] Formula_1.2-5 scuttle_1.12.0
[173] GenomeInfoDbData_1.2.11 dotCall64_1.1-0
[175] patchwork_1.1.3 munsell_0.5.0
[177] Rcpp_1.0.11 viridis_0.6.4
[179] reticulate_1.34.0 stringi_1.7.12
[181] zlibbioc_1.48.0 MASS_7.3-60
[183] AnnotationHub_3.10.0 plyr_1.8.9
[185] parallel_4.3.1 listenv_0.9.0
[187] ggrepel_0.9.4 deldir_1.0-9
[189] Biostrings_2.70.1 splines_4.3.1
[191] tensor_1.5 circlize_0.4.15
[193] hms_1.1.3 locfit_1.5-9.8
[195] igraph_1.5.1 spatstat.geom_3.2-7
[197] rngtools_1.5.2 reshape2_1.4.4
[199] biomaRt_2.58.0 stats4_4.3.1
[201] ScaledMatrix_1.10.0 BiocVersion_3.18.0
[203] XML_3.99-0.14 evaluate_0.23
[205] SeuratObject_5.0.0 BiocManager_1.30.22
[207] scran_1.30.0 foreach_1.5.2
[209] httpuv_1.6.12 RANN_2.6.1
[211] tidyr_1.3.0 purrr_1.0.2
[213] polyclip_1.10-6 clue_0.3-65
[215] heatmaply_1.5.0 future_1.33.0
[217] scattermore_1.2 ggplot2_3.4.4
[219] gridBase_0.4-7 rsvd_1.0.5
[221] xtable_1.8-4 later_1.3.1
[223] viridisLite_0.4.2 tibble_3.2.1
[225] EBAII.n1.SC.helper_0.0.2 registry_0.5-1
[227] memoise_2.0.1 AnnotationDbi_1.64.0
[229] IRanges_2.36.0 cluster_2.1.4
[231] globals_0.16.2 GSEABase_1.64.0
[233] shinyAce_0.4.2