logo



1 A bit of theory

1.1 Why ?

Because …

  • … cells in a defined environment (body, organoid, …) are :
    1. possibly of different kinds (cell types)
    2. able to temporaly evolve from a state to another (ex : stem to differentiated)
    3. not (all) synchronous …
  • … and a single cell experiment is an observation of numerous cells of various types in a common context :
    1. Various types : states ?
    2. Common context : lineage ?
    3. Numerous : internal variation in each state ?
logo

1.2 How ?

1.2.1 Main steps

The basic principle can be divided into 3 main steps :

  • Inferring a schematic representation of the transitional behavior of some cells : deciphering a topology from the data
    • Which data ?
      • Cell by feature matrix ? (spoiler : NO)
      • A reduced space ?
        • Which method ?
        • Number of dimensions ?
    • Which topology ?
      • Linear ?
      • Circular ?
      • Poly-cyclic (ie, nested cycles)?
      • Bifurcations ?
      • Multi-furcations ?
      • Simple / complex tree ?
      • A combination of some of all above ?
      • Single one / multiple disconnected ?
      • Oriented or not (rooting) ?
logo


  • Positioning each cell in the defined topology : defining cells pseudotimes
    • Directly / Proxy (clusters) ?
    • Direction : single / multiple roots ? Which one(s) ?

1.2.2 Inferring a topology

  • Most methods rely on identifying a minimal spanning tree (MST), with specific variations (to allow cycles, by example) and tweaks (branch pruning).

  • The key idea in MST is to define the simplest path that connects all cells (vertices) using ponderated edges (their distance in the reduced space)

logo


1.2.3 Defining cells pseudotime

Once a topology has been created, pseudotime of a cell consists in the rank of the cell projection to its closest topology backbone point

logo


1.2.4 Historically

  • First methods (circa 2014)
    • Strong priors :
      • Reduced space (PCA)
      • Clustering results (~ cell states) or inferred cell types
      • Expected topology type
        • Single one
        • Simple : linear (direct or cyclic), bifurcation tree
  • Evolution :
    • Lower priors (just space +/- clusters +/- root)
      • No mandatory input topology type
    • Better-fited types of reduced space (tSNE, uMAP, LLE, Diffusion maps, Poincarre maps, GPLVM, …)

1.2.5 Methods : Welcome to the dynverse !

Inferring any kind of topology is a complex task :

“It is perhaps surprising that of the 59 methods in existence today [at the time of the publication], almost all methods have a unique combination of [these prior + reduced space + topology inferring method] characteristics” (dynverse publication, 2018)

  • Multiple (dozens) of methods are available, either through R or python packages, or independent binaries.

  • Wouter Saelens and Robrecht Cannoodt did a tremendous work setting up the dynverse, a framework that :

    • re-implements/wraps 55 existing methods (dynmethods)
    • harmonizes them in a common way of setting inputs and parameters, getting results (dyno, dynparam)
    • is compatible with any novel method for new developers (dynwrap)
    • contains metrics to compare methods
    • allows a comparison of numerous methods (dyneval)
    • is available as reference map to help select the best method based on user context (dynguidelines)
dynguidelines::guidelines_shiny()
logo


2 Practice

2.1 Prerequisites

2.1.1 This Rmarkdown

Copy this Rmd file (and its dependences) from the common ‘Courses’ dir to your working directory :

## Working directory
user.id <- Sys.getenv('USER')
root.dir <- '/shared/projects/sincellte_2022'
script.dir <- paste(c(root.dir, 'Courses', 'Secondary_analyses',
                      'scripts'), collapse = '/')
work.dir <- paste(c(root.dir, user.id, 'Courses', 'Secondary_analysis',
  'trajectory_analysis'), collapse = '/')
dir.create(path = paste(c(work.dir, 'trajectory_rmd_resources'), collapse = '/'), 
           recursive = TRUE, showWarnings = FALSE)
## Copying the Rmd
file.copy(from = paste0(script.dir, '/Trajectory_TD_deliverable.Rmd'),
          to = work.dir, overwrite = TRUE)
## Copying the dependencies
file.copy(from = list.files(
  path = paste0(script.dir, '/trajectory_rmd_resources/'), full.names = TRUE), 
  to = paste0(work.dir, '/trajectory_rmd_resources/'), overwrite = TRUE)
## Moving to your working directory
setwd(work.dir)
## If you want to visualize your files in Rstudio, you can click the little greyish/white arrow on the right of the path written at the top of your Rstudio Console.

2.1.2 The dataset

  • This practice is inspired by this session from this Galaxy training
  • The dataset you will use :
    • Comes from [Bacon et al, Frontiers in Immunology, 2018].
    • Is available as a scanpy h5ad object (tool-specific format based upon the HDF5 standard)
    • Is hosted at Zenodo #4726927
    • Consists in a drop-seq preparation of thymic cells in differenciation from 7 neonatal mice, selected by mass flow cytometry using 22 immunologic markers corresponding to multiple T-cell types and subtypes (see embedded Excel file)
    • For this practical session, will focus on these subtypes :
      • Double-negative (DN) : Early immature cells
      • Double-positive Module 1 (DP-M1) : Intermediate state 1
      • Double-positive Module 2 (DP-M2) : Intermediate state 2
      • Double-positive Module 3 (DP-M3) : Intermediate state 3
      • Double-positive Module 4 (DP-M4) : Intermediate state 4
      • Double-positive Late (DP-L) : Immature late state
      • Mature T-cells (T-mat)

Download markers_celltypes.xlsx

2.1.3 The environment

The R packages needed for this tutorial are :

Additional packages needed to build this HTML report :

  • CRAN :
    • gifski To generate GIF animations of markers expression
    • knitr To set the Rmarkdown hook to compute CPU time of chunks
    • Hmisc To set the button-hidden TOC

2.2 Preprocessing steps

This is just a description of the preparation I performed on the data, building the the object you will work on. This is not part of the practical session.

Unfold

2.2.1 Setting our variables

## Remote input file
remote.h5ad.file <- 'https://zenodo.org/record/4726927/files/Final_cell_annotated_object.h5ad?download=1'
## Working directory
user.id <- Sys.getenv('USER')
root.dir <- '/shared/projects/sincellte_2022'
data.dir <- paste(c(root.dir, 'Courses', 'Secondary_analyses', 'input', 'trajectory_analysis'), collapse = '/')
work.dir <- paste(c(root.dir, user.id, 'Courses', 'Secondary_analysis', 'trajectory_analysis'), collapse = '/')
## Seurat assay name
assay <- 'RNA'
## Contaminating cell types
contam.types <- c('Macrophages', 'RBC')
## RNG seed (for irlba PCA)
my.seed <- 1337L
## HVGs to keep
nfeatures <- 3000
## 1st reduction method
reduction <- 'pca'
## Number of PCs to produce
pca.dims <- 100
## Number of PCs to retain for 2nd reduction
maxdim <- 30
## Use scaled data instead of unscaled
use.scaled <- TRUE
## Cell type to use as root to orient the trajectory
root.group <- 'DN'

2.2.2 Loading source data

  • First, the dataset was retrieved from Zenodo
dir.create(path = work.dir, recursive = TRUE, showWarnings = FALSE)
local.root.file <- paste0(work.dir, '/Original_scanpy_object.h5')
local.h5ad.file <- paste0(local.root.file, 'ad')
download.file(url = h5ad.file, destfile = local.h5ad.file,
              method = 'wget', quiet = TRUE)
SeuratDisk::Convert(source = local.h5ad.file, dest = 'h5seurat',
                    assay = assay, verbose = FALSE, overwrite = TRUE)
  • The converted object was loaded into R as a SeuratObject using the SeuratDisk::LoadH5Seurat() function. All optional slots (neighbors, graphs, reductions, images), won’t be loaded as we want to reprocess them.
local.h5s.file <- paste0(local.root.file, 'seurat')
sobj <- SeuratDisk::LoadH5Seurat(file = local.h5s.file, assays = assay,
                                 reductions = FALSE, graphs = FALSE,
                                 neighbors = FALSE, images = FALSE)

2.2.3 Prepping the object to our context

  • The downloaded object contains cell-type annotation from the original analysis. From the publication, we learnt that two cell types are not related to T cells, thus considered as contaminants : macrophages and red blood cells (RBC). As their presence in the analysis space can modify the shape of the cells position, thus impact trajectory analysis, we decided to remove them :
sobj <- sobj[, !(sobj$cell_type %in% contam.types)]
  • We then follow the Seurat4 canonical pipeline, applying log-normalization to the count data …
sobj <- Seurat::NormalizeData(object = sobj, assay = assay)
  • … then scaling. As this object is a merge of 7 experiments (labelled as ‘batch’ in the object metadata), we want to regress this effect.
sobj <- Seurat::ScaleData(object = sobj, assay = assay,
                          vars.to.regress = c('batch'))
  • … then identify HVGs. As we focus on subtypes of a single, main cell type (T-cells) in a differentiation context, we anticipate that we may have limited variation of gene expression (at least compared to, by example, multiple cell types of a differentiated organ). Thus, we increase the default number of features to output from 2,000 to 3,000.
sobj <- Seurat::FindVariableFeatures(object = sobj, assay = assay,
                                     verbose = FALSE, nfeatures = nfeatures)
  • … then reducing dimensions (through PCA). We request 100 PCs as output.
sobj <- Seurat::RunPCA(object = sobj, assay = assay, npcs = pca.dims,
                       reduction.name = reduction, seed.use = my.seed,
                       verbose = FALSE)
  • … then building additional reductions for exploration / visualization (t-SNE, uMAP). After manual exploration, we decided to retain 30 PCs as input.
## TSNE
reduction2t <- paste0(reduction, '.', maxdim, '_tsne')
sobj <- Seurat::RunTSNE(object = sobj, dims = 1:maxdim, assay = assay,
                        reduction = reduction, seed.use = my.seed,
                        verbose = FALSE, reduction.name = reduction2t)
## uMAP
reduction2u <- paste0(reduction, '.', maxdim, '_umap')
sobj <- Seurat::RunUMAP(object = sobj, dims = 1:maxdim, assay = assay,
                        reduction = reduction, seed.use = my.seed,
                        verbose = FALSE, reduction.name = reduction2u)
Seurat::DimPlot(object = sobj, reduction = reduction2u, group.by = 'cell_type',
                seed = my.seed)
  • Lastly, we save the processed object (as a RDS file).
saveRDS(object = sobj, file = paste0(work.dir,'/Thymus_neonat_mm_Seurat_ENSEMBL.RDS'),
        compress = 'bzip2')

2.3 Trajectory analysis using TInGa

We will use a tool named TInGa to perform this analysis, due to its convenience to our educational purpose :

  • It’s in full R
  • It’s very fast
  • Its parameters are limited (as compared to other renowned tools like Monocle3)
  • It is compatible with most topologies (see here)
  • It’s integrated in the versatile dynverse (as a plugin)

2.3.1 Setting parameters

## Working directory
user.id <- Sys.getenv('USER')
root.dir <- '/shared/projects/sincellte_2022'
data.dir <- paste(c(root.dir, 'Courses', 'Secondary_analyses', 'input',
                    'trajectory_analysis'), collapse = '/')
work.dir <- paste(c(root.dir, user.id, 'Courses', 'Secondary_analysis',
                    'trajectory_analysis'), collapse = '/')
if(!dir.exists(work.dir)) dir.create(path = work.dir, recursive = TRUE)
## Local input file
## The file created if you performed the preprocessing by yourself
# local.Seurat.file <- paste0(work.dir, '/Thymus_neonat_mm_Seurat_ENSEMBL.RDS')
## The preprocessed file I prepared for you
local.Seurat.file <- paste0(data.dir, '/Thymus_neonat_mm_Seurat_ENSEMBL.RDS')
## Seurat assay name
assay <- 'RNA'
## RNG seed (for irlba PCA)
my.seed <- 1337L
## 1st reduction method
reduction <- 'pca'
## 2nd reduction method
reduction2 <- 'pca.30_umap'
## Use scaled data instead of unscaled
use.scaled <- TRUE
## Cell type to use as root to orient the trajectory
root.group <- 'DN'
## Marker genes to plot
trans.markers <- c('Il2ra', 'Npm1', 'Nusap1', 'Ube2c',
                   'Pclaf','Cd8b1', 'Itm2a', 'Cd52', 'H2-D1')

2.3.2 Loading source data

We will simply load the pre-processed Seurat object, contained in a RDS archive, as input :

sobj <- readRDS(file = local.Seurat.file)

Time for this code chunk to run : 1.305 s

2.3.3 Performing trajectory analysis

2.3.3.1 The reduced space

First, to impregnate us with the cells topology on which TInGa will work, we can display our 2D dimension reduction (uMAP or t-SNE) :

## With batches
db <- Seurat::DimPlot(object = sobj, reduction = reduction2, group.by = 'batch',
                      pt.size = 2) + Seurat::DarkTheme()
## With cell types
dct <- Seurat::DimPlot(object = sobj, reduction = reduction2, group.by = 'cell_type',
                       pt.size = 2) + Seurat::DarkTheme()
## With sex
ds <- Seurat::DimPlot(object = sobj, reduction = reduction2, group.by = 'sex',
                      pt.size = 2) + Seurat::DarkTheme()
print(patchwork::wrap_plots(list(db, dct, ds)))

2.3.3.2 Extracting data

Using TInGa first requires to build a specific object from the count and normalized matrices, that we will extract from our Seurat object :

## Matrices are required to be formatted as [cells] x [features], thus a transposition is required
## Normalized matrix
my.exp <- t(as.matrix(if(use.scaled) sobj@assays[[assay]]@scale.data
                      else sobj@assays[[assay]]@data)) 
### Ordering to keep synch in the [use.scaled] context
my.exp <- my.exp[, order(colnames(my.exp))]
## Count matrix
my.counts <- t(as.matrix(if(use.scaled) sobj@assays[[assay]]@counts[rownames(sobj@assays[[assay]]@counts)%in% colnames(my.exp),]
                         else sobj@assays[[assay]]@counts))
### Ordering to keep synch in the [use.scaled] context
my.counts <- my.counts[, order(colnames(my.counts))]

2.3.3.3 Building a dynwrap object

We can now build our object (a dynwrap object, which actually consists in a list with mandatory entries) :

my.data <- dynwrap::wrap_expression(
  id = sobj@project.name,
  expression = my.exp,
  counts = my.counts
)
is(my.data)
## [1] "dynwrap::with_expression"

Time for this code chunk to run : 2.55 s

2.3.3.4 Import prior

We can import our reduced space (t-SNE or uMAP) as a prior, which will be used by TInGa to build the trajectory :

my.data <- dynwrap::add_prior_information(
  dataset = my.data,
  dimred = sobj@reductions[[reduction2]]@cell.embeddings
)

2.3.3.5 Running TInGa

Now, we can run TInGa !

## Performing TInGa analysis
set.seed(my.seed)
my.traj <- dynwrap::infer_trajectory(dataset = my.data, method = TInGa::gng_param2(),
                                     seed = my.seed, give_priors = 'dimred',
                                     parameters = list(max_nodes = 8, lambda = 200))
## Adding pseudotime
my.traj <- dynwrap::add_pseudotime(trajectory = my.traj)
## Warning in calculate_pseudotime(trajectory): Trajectory is not rooted. Add a
## root to the trajectory using dynwrap::add_root(). This will result in an error
## in future releases.
## root cell or milestone not provided, trying first outgoing milestone_id
## Using '1' as root
## Plotting with pseudotime
dynplot::plot_dimred(trajectory = my.traj, label_milestones = TRUE,
                     color_cells = 'pseudotime') + Seurat::DarkTheme()
## Registered S3 method overwritten by 'cli':
##   method     from         
##   print.boxx spatstat.geom

Time for this code chunk to run : 18.4 s

2.3.3.6 Adding cell types

We can import our cell types (as a metadata for later plots) in the trajectory object :

## Plotting cell types as grouping
my.traj <- dynwrap::add_grouping(dataset = my.traj,
                                 grouping = as.character(sobj$cell_type))
## Plotting with cell types
dynplot::plot_dimred(trajectory = my.traj, label_milestones = TRUE,
                     color_cells = "grouping") + Seurat::DarkTheme()

2.3.3.7 Adding a root

  • As we know our root cell type (DN), we can give a direction to the trajectory. However, TInGa requires a unique root cell, not a root cluster… Thus, we will first identify a putative root cell by computing the centroid of the DN cells in the reduced space :
if (!is.null(root.group)) {
  ## Identifying cells of the root group
  cells_in_root.group <- which(my.traj$grouping == root.group)
  ## Computing the root group medoid in the reduced space
  root_group.medoid <- apply(my.traj$dimred[cells_in_root.group,], 2, median)
  ## Computing the root group cells distance to root group medoid
  cells.dist_to_medoid <- vapply(cells_in_root.group, function(x) { norm(as.matrix(my.traj$dimred[x,] - root_group.medoid), 'F')}, .1)
  ## Identifying the root cell
  root.cell.id <- my.traj$cell_info$cell_id[cells_in_root.group[which.min(cells.dist_to_medoid)]]
  message(paste0(root.group, ' root cell : ', root.cell.id))
  ## Plotting cell types
  dct <- Seurat::DimPlot(object = sobj, reduction = reduction2, 
                         group.by = 'cell_type', pt.size = 2) + Seurat::DarkTheme()
  ## Plotting the root cell
  sobj$root_cell <- FALSE
  sobj$root_cell[colnames(sobj) == root.cell.id] <- TRUE
  drc <- Seurat::DimPlot(object = sobj, reduction = reduction2, 
                         group.by = 'root_cell', pt.size = c(rep(2, ncol(sobj)-1), 5),
                         order = TRUE) + Seurat::DarkTheme()
  print(patchwork::wrap_plots(list(dct, drc)))
}
## DN root cell : AGGGCGTAGTTA-0

Time for this code chunk to run : 2.057 s

  • Rooted plots
## Adding root cell
my.traj <- dynwrap::add_root(trajectory = my.traj, root_cell_id = root.cell.id)
## Plotting with cell types
dct <- dynplot::plot_dimred(trajectory = my.traj, label_milestones = TRUE, 
                            color_cells = "grouping") + Seurat::DarkTheme()
## Plotting with pseudotime
dpt <- dynplot::plot_dimred(trajectory = my.traj, label_milestones = TRUE, 
                            color_cells = "pseudotime") + Seurat::DarkTheme()
print(patchwork::wrap_plots(list(dct, dpt)))

2.3.3.8 Plotting marker genes

  • First, we’ll replace EnsemblIDs with MGI Symbols (easier to handle)
## Injecting symbols
symbvec <- as.character(sobj@assays$RNA@meta.features$Symbol)
names(symbvec) <- sobj@assays$RNA@meta.features$ID
colnames(my.data$counts) <- colnames(my.data$expression) <- unname(symbvec[colnames(my.data$expression)])
  • Plotting markers :
    • From milestone 1 to 3 :
## Milestones 1 to 3 ...
tmplots <- sapply(trans.markers[1:4], function(x) {
  dynplot::plot_dimred(trajectory = my.traj, label_milestones = TRUE, 
                       color_cells = "feature", feature_oi = x, 
                       expression_source = my.data$expression) + Seurat::DarkTheme()
  }, simplify = FALSE)
print(patchwork::wrap_plots(tmplots))

Time for this code chunk to run : 39.59 s

  • From milestone 1 to 2 :
## Milestones 1 to 2 ...
tmplots <- sapply(trans.markers[c(1:2,5:9)], function(x) {
  dynplot::plot_dimred(trajectory = my.traj, label_milestones = TRUE, 
                       color_cells = "feature", feature_oi = x, 
                       expression_source = my.data$expression) + Seurat::DarkTheme()
  }, simplify = FALSE)
print(patchwork::wrap_plots(tmplots))

Time for this code chunk to run : 1.126 s

2.3.3.9 Animated !

  • From milestone 1 to 3 :
## Milestones 1 to 3 ...
for (x in trans.markers[1:4]) print(
  dynplot::plot_dimred(trajectory = my.traj, label_milestones = TRUE, 
                       color_cells = "feature", feature_oi = x, 
                       expression_source = my.data$expression) + Seurat::DarkTheme())
anim2a


  • From milestone 1 to 2 :
## Milestones 1 to 2 ...
for (x in trans.markers[c(1:2,5:9)]) print(
  dynplot::plot_dimred(trajectory = my.traj, label_milestones = TRUE, 
                       color_cells = "feature", feature_oi = x, 
                       expression_source = my.data$expression) + Seurat::DarkTheme())
anim2b


3 Trajectory analysis using Slingshot

  • Slingshot is a way more commonly used trajectory inference tool than TInGa (that I chose for educational purpose, among other things).

  • Like TInGa, it’s a full-R, easy to use tool that is compatible with most trajectory topologies.

  • However, it does have a key difference : compared to TInGa, it additionally requires a cell clustering results. This may be either an advantage (if you are confident in the capacity of you clustering results to reflect the biology, espacially the cells lineage) or a con (if you are not).

  • Running Slingshot

## Run slingshot with root cluster
sling.res <- slingshot::slingshot(data = sobj@reductions$pca.30_umap@cell.embeddings,
                                  clusterLabels = sobj$cell_type, start.clus = 'DN')
## Plot results (lineage)
plot(sobj@reductions$pca.30_umap@cell.embeddings, 
     col = RColorBrewer::brewer.pal(9,"Set1")[sobj$cell_type], 
     asp = 1, pch = 16)
lines(slingshot::as.SlingshotDataSet(sling.res), type = 'lineages')

## Plot results (fitted curve)
plot(sobj@reductions$pca.30_umap@cell.embeddings, 
     col = RColorBrewer::brewer.pal(9,"Set1")[sobj$cell_type], 
     asp = 1, pch = 16)
lines(slingshot::as.SlingshotDataSet(sling.res))

Time for this code chunk to run : 14.77 s

Slingshot identified 3 lineages, much in concordance with TInGa results !

  • Getting cells pseudotimes (per lineage)
## Getting pseudotimes
my.pseudo <- slingshot::slingPseudotime(x = sling.res)
pal <- viridis::viridis(100, end = 0.95)

oripar <- par(no.readonly = TRUE)
par(mfrow = c(1,3))
for (lin.x in 1:ncol(my.pseudo)) {
  colors <- pal[cut(my.pseudo[,lin.x], breaks = 100)]
  plot(sobj@reductions$pca.30_umap@cell.embeddings, col = colors, 
       pch = 16, cex = 0.5, main = colnames(my.pseudo)[lin.x])
  lines(slingshot::as.SlingshotDataSet(sling.res))
}

par(oripar)

Time for this code chunk to run : 0.426 s

NOTE : Slingshot is also available through the dynverse !

4 Limits

  • No one method to rule them all (some are or more efficient for some specific types of topology, terrible with others)
  • Dimension reduction : in the name of interpretation/visualization (and CPU time), but is it relevant ?
    • Cooley et al. deposited a paper in 2019 on biorXiv (slight warning : thus, non-peer-reviewed yet), updated very recently, stating that trajectory inference based on reduced spaces fail in every case !
    • The only working case according to them : using PCA with … almost all dimensions ?!
      • Not a reduced space anymore ?
    • Why : all tested reduction methods sacrifice too much information to be able to represent the true topology right without breakage, just so that we can visualize it with our poor human brains
    • So, how to deal with it ?
      • As a Maths teacher told me, “Mathematics are an art of reasoning right on wrong images”. However, in our case we would call it observer bias!
      • A solution could be to generate pseudotime values in a high-dimension space (like the recommended "almost-full’ PCA), without visualization.
        • Drawback : how to evaluate, how to interpret such results ? The question will remain open for today.
logo


6 R session

## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=fr_FR.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=fr_FR.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dynutils_1.0.9 dynwrap_1.2.2  purrr_0.3.4    tidyr_1.1.4    dplyr_1.0.7   
## 
## loaded via a namespace (and not attached):
##   [1] utf8_1.2.2                  reticulate_1.22            
##   [3] tidyselect_1.1.1            htmlwidgets_1.5.4          
##   [5] grid_4.1.2                  ranger_0.13.1              
##   [7] Rtsne_0.15                  munsell_0.5.0              
##   [9] codetools_0.2-18            ica_1.0-2                  
##  [11] future_1.23.0               miniUI_0.1.1.1             
##  [13] gng_0.1.0                   colorspace_2.0-2           
##  [15] Biobase_2.54.0              highr_0.9                  
##  [17] knitr_1.37                  rstudioapi_0.13            
##  [19] SingleCellExperiment_1.16.0 stats4_4.1.2               
##  [21] Seurat_4.0.6                ROCR_1.0-11                
##  [23] TInGa_0.0.0.9000            tensor_1.5                 
##  [25] listenv_0.8.0               MatrixGenerics_1.6.0       
##  [27] labeling_0.4.2              dynplot_1.1.2              
##  [29] GenomeInfoDbData_1.2.7      polyclip_1.10-0            
##  [31] farver_2.1.0                rprojroot_2.0.2            
##  [33] parallelly_1.30.0           vctrs_0.3.8                
##  [35] generics_0.1.1              xfun_0.29                  
##  [37] GenomeInfoDb_1.30.0         R6_2.5.1                   
##  [39] GA_3.2.2                    graphlayouts_0.8.0         
##  [41] DelayedArray_0.20.0         bitops_1.0-7               
##  [43] spatstat.utils_2.3-0        assertthat_0.2.1           
##  [45] promises_1.2.0.1            scales_1.1.1               
##  [47] ggraph_2.0.5                nnet_7.3-16                
##  [49] lmds_0.1.0                  gtable_0.3.0               
##  [51] dynfeature_1.0.0            babelwhale_1.0.3           
##  [53] globals_0.14.0              processx_3.5.2             
##  [55] goftest_1.2-3               tidygraph_1.2.0            
##  [57] rlang_0.4.12                randomForestSRC_3.0.0      
##  [59] splines_4.1.2               lazyeval_0.2.2             
##  [61] princurve_2.1.6             spatstat.geom_2.3-1        
##  [63] checkmate_2.0.0             yaml_2.2.1                 
##  [65] reshape2_1.4.4              abind_1.4-5                
##  [67] backports_1.4.1             httpuv_1.6.5               
##  [69] Hmisc_4.6-0                 DiagrammeR_1.0.6.1         
##  [71] tools_4.1.2                 ggplot2_3.3.5              
##  [73] ellipsis_0.3.2              spatstat.core_2.3-2        
##  [75] jquerylib_0.1.4             RColorBrewer_1.1-2         
##  [77] BiocGenerics_0.40.0         ggridges_0.5.3             
##  [79] Rcpp_1.0.7                  plyr_1.8.6                 
##  [81] sparseMatrixStats_1.6.0     zlibbioc_1.40.0            
##  [83] base64enc_0.1-3             visNetwork_2.1.0           
##  [85] RCurl_1.98-1.5              TrajectoryUtils_1.0.0      
##  [87] ps_1.6.0                    rpart_4.1-15               
##  [89] deldir_1.0-6                pbapply_1.5-0              
##  [91] viridis_0.6.2               cowplot_1.1.1              
##  [93] S4Vectors_0.32.3            dynparam_1.0.2             
##  [95] zoo_1.8-9                   SummarizedExperiment_1.24.0
##  [97] SeuratObject_4.0.4          ggrepel_0.9.1              
##  [99] cluster_2.1.2               magrittr_2.0.1             
## [101] data.table_1.14.2           scattermore_0.7            
## [103] carrier_0.1.0               lmtest_0.9-39              
## [105] RANN_2.6.1                  fitdistrplus_1.1-6         
## [107] matrixStats_0.61.0          hms_1.1.1                  
## [109] patchwork_1.1.1             mime_0.12                  
## [111] evaluate_0.14               xtable_1.8-4               
## [113] jpeg_0.1-9                  IRanges_2.28.0             
## [115] gridExtra_2.3               testthat_3.1.1             
## [117] compiler_4.1.2              tibble_3.1.6               
## [119] KernSmooth_2.23-20          crayon_1.4.2               
## [121] htmltools_0.5.2             proxyC_0.2.4               
## [123] mgcv_1.8-38                 later_1.3.0                
## [125] tzdb_0.2.0                  Formula_1.2-4              
## [127] RcppParallel_5.1.5          DBI_1.1.2                  
## [129] tweenr_1.0.2                slingshot_2.0.0            
## [131] MASS_7.3-54                 data.tree_1.0.0            
## [133] Matrix_1.4-0                readr_2.1.1                
## [135] cli_3.1.0                   parallel_4.1.2             
## [137] igraph_1.2.11               GenomicRanges_1.46.1       
## [139] pkgconfig_2.0.3             dyneval_0.9.9              
## [141] foreign_0.8-81              plotly_4.10.0              
## [143] spatstat.sparse_2.1-0       foreach_1.5.1              
## [145] bslib_0.3.1                 XVector_0.34.0             
## [147] stringr_1.4.0               digest_0.6.29              
## [149] dyndimred_1.0.4             sctransform_0.3.2          
## [151] RcppAnnoy_0.0.19            spatstat.data_2.1-2        
## [153] rmarkdown_2.11              leiden_0.3.9               
## [155] htmlTable_2.4.0             uwot_0.1.11                
## [157] DelayedMatrixStats_1.16.0   shiny_1.7.1                
## [159] lifecycle_1.0.1             nlme_3.1-153               
## [161] jsonlite_1.7.2              desc_1.4.0                 
## [163] viridisLite_0.4.0           fansi_1.0.0                
## [165] pillar_1.6.4                lattice_0.20-45            
## [167] fastmap_1.1.0               httr_1.4.2                 
## [169] survival_3.2-13             glue_1.6.0                 
## [171] remotes_2.4.2               png_0.1-7                  
## [173] iterators_1.0.13            ggforce_0.3.3              
## [175] stringi_1.7.6               sass_0.4.0                 
## [177] latticeExtra_0.6-29         irlba_2.3.5                
## [179] future.apply_1.8.1
---
title: "SinCellTE 2022<BR>Secondary Analyses : Trajectory Inference"
author: "Bastien JOB<BR>bastien.job@gustaveroussy.fr"
date: "2022-01-11"
output:
  html_document: 
    background: black
    fig_height: 10
    fig_width: 15
    highlight: espresso  ## Theme for the code chunks
    number_sections: yes  ## Adds number to headers (sections)
    theme: flatly  ## CSS theme for the HTML page
    toc: yes  ## Adds a table of content
    toc_float:  ## TOC options
      collapsed: yes  ## By default, the TOC is folded
      smooth_scroll: yes ## Smooth scroll of the HTML page
    self_contained: yes ## Includes all plots/images within the HTML
    code_download: yes ## Adds a button to download the Rmd
always_allow_html: yes ## Allow plain HTML code in the Rmd
header-includes:
   # - \usepackage{gifski}
---

<!-- Automatically computes and prints in the output the running time for any code chunk -->
```{r timehook, echo = FALSE}
knitr::knit_hooks$set(time_it = local({
  now <- NULL
  function(before, options) {
    if (before) {
      # record the current time before each chunk
      now <<- Sys.time()
    } else {
      # calculate the time difference after a chunk
      res <- difftime(Sys.time(), now)
      # return a character string to show the time
      paste0('Time for this code chunk to run : ', round(x = res, digits = 3), ' s')
    }
  }
}))
# knitr::opts_chunk$set(time_it = FALSE)
```

<!-- Allows to hide the TOC by default, display it with a button, move it to the right or left of the page -->
`r Hmisc::hidingTOC(buttonLabel = 'Show TOC', hidden = TRUE, tocSide = 'left', buttonSide='left', posCollapse = 'margin', levels = 3)`

```{r userid, echo = FALSE}
user.id <- Sys.getenv('USER')
```

<center>![logo](/shared/projects/sincellte_2022/`r user.id`/Courses/Secondary_analysis/trajectory_analysis/trajectory_rmd_resources/baniere_titre.png)</center>
<br><br>

# A bit of theory

## Why ?

***Because ...***

* ... cells in a defined environment (body, organoid, ...) are :
  1. possibly of different kinds (cell types)
  2. able to temporaly evolve from a state to another (ex : stem to differentiated)
  3. not (all) synchronous ...
  
* ... and a single cell experiment is an observation of *numerous* cells of *various types* in a *common context* :
  1. Various types : states ?
  2. Common context : lineage ?
  3. Numerous : internal variation in each state ?

<center>![logo](/shared/projects/sincellte_2022/`r user.id`/Courses/Secondary_analysis/trajectory_analysis/trajectory_rmd_resources/itsamatch.jpeg)</center>

## How ?

### Main steps

The basic principle can be divided into 3 main steps :

* Inferring a schematic representation of the transitional behavior of some cells : deciphering a topology from the data
  * Which data ?
    * Cell by feature matrix ? (spoiler : ***NO***)
    * A reduced space ?
      * Which method ?
      * Number of dimensions ?
  * Which topology ?
    * Linear ?
    * Circular ?
    * Poly-cyclic (ie, nested cycles)?
    * Bifurcations ?
    * Multi-furcations ?
    * Simple / complex tree ?
    * A combination of some of all above ?
    * Single one / multiple disconnected ?
    * Oriented or not (rooting) ?

<center>![logo](/shared/projects/sincellte_2022/`r user.id`/Courses/Secondary_analysis/trajectory_analysis/trajectory_rmd_resources/topologies.png)</center>
<br>

* Positioning each cell in the defined topology : defining cells pseudotimes
  * Directly / Proxy (clusters) ?
  * Direction : single / multiple roots ? Which one(s) ?

### Inferring a topology

* Most methods rely on identifying a minimal spanning tree ([MST](https://en.wikipedia.org/wiki/Minimum_spanning_tree)), with specific variations (to allow cycles, by example) and tweaks (branch pruning).

* The key idea in MST is to define the simplest path that connects all cells (vertices) using ponderated edges (their distance in the reduced space)

<center>![logo](/shared/projects/sincellte_2022/`r user.id`/Courses/Secondary_analysis/trajectory_analysis/trajectory_rmd_resources/mst.jpg)</center>
<br>

### Defining cells pseudotime

Once a topology has been created, pseudotime of a cell consists in the rank of the cell projection to its closest topology backbone point

<center>![logo](/shared/projects/sincellte_2022/`r user.id`/Courses/Secondary_analysis/trajectory_analysis/trajectory_rmd_resources/pseudotime.png)</center>
<br>

### Historically

* First methods (circa 2014)
  * Strong priors :
    * Reduced space (PCA)
    * Clustering results (~ cell states) or inferred cell types
    * Expected topology type
      * Single one
      * Simple : linear (direct or cyclic), bifurcation tree
* Evolution :
  * Lower priors (just space +/- clusters +/- root)
    * No mandatory input topology type
  * Better-fited types of reduced space (tSNE, uMAP, LLE, [Diffusion maps](https://www.sciencedirect.com/science/article/pii/S1063520306000546), [Poincarre maps](https://pubmed.ncbi.nlm.nih.gov/32528075/), [GPLVM](https://pyro.ai/examples/gplvm.html), ...)

### Methods : Welcome to the dynverse !

Inferring any kind of topology is a complex task :

>"It is perhaps surprising that of the 59 methods in existence today [at the time of the publication], almost all methods have a unique combination of [these prior + reduced space + topology inferring method] characteristics" ([dynverse publication](https://pubmed.ncbi.nlm.nih.gov/30936559/), 2018)

* Multiple (dozens) of methods are available, either through R or python packages, or independent binaries.

* Wouter Saelens and Robrecht Cannoodt did a tremendous work setting up the ***dynverse***, a framework that :
  * re-implements/wraps 55 existing methods *(dynmethods)*
  * harmonizes them in a common way of setting inputs and parameters, getting results *(dyno, dynparam)*
  * is compatible with any novel method for new developers *(dynwrap)*
  * contains metrics to compare methods
  * allows a comparison of numerous methods *(dyneval)*
  * is available as reference map to help select the best method based on user context *(dynguidelines)*

```{r dynguid, eval = FALSE, attr.source='.numberLines'}
dynguidelines::guidelines_shiny()
```

<center>![logo](/shared/projects/sincellte_2022/`r user.id`/Courses/Secondary_analysis/trajectory_analysis/trajectory_rmd_resources/dynguidelines.png)</center>
<br>

# Practice

## Prerequisites

### This Rmarkdown

Copy this Rmd file (and its dependences) from the common 'Courses' dir to your working directory :

```{r rmdcp, echo = TRUE, attr.source='.numberLines', results = 'hide'}
## Working directory
user.id <- Sys.getenv('USER')
root.dir <- '/shared/projects/sincellte_2022'
script.dir <- paste(c(root.dir, 'Courses', 'Secondary_analyses',
                      'scripts'), collapse = '/')
work.dir <- paste(c(root.dir, user.id, 'Courses', 'Secondary_analysis',
  'trajectory_analysis'), collapse = '/')
dir.create(path = paste(c(work.dir, 'trajectory_rmd_resources'), collapse = '/'), 
           recursive = TRUE, showWarnings = FALSE)
## Copying the Rmd
file.copy(from = paste0(script.dir, '/Trajectory_TD_deliverable.Rmd'),
          to = work.dir, overwrite = TRUE)
## Copying the dependencies
file.copy(from = list.files(
  path = paste0(script.dir, '/trajectory_rmd_resources/'), full.names = TRUE), 
  to = paste0(work.dir, '/trajectory_rmd_resources/'), overwrite = TRUE)
## Moving to your working directory
setwd(work.dir)
## If you want to visualize your files in Rstudio, you can click the little greyish/white arrow on the right of the path written at the top of your Rstudio Console.
```

### The dataset

* This practice is inspired by this [session](https://training.galaxyproject.org/training-material/topics/transcriptomics/tutorials/scrna-JUPYTER-trajectories/tutorial.html) from this [Galaxy training](https://training.galaxyproject.org/training-material/topics/transcriptomics/)
* The dataset you will use :
  * Comes from [Bacon et al, Frontiers in Immunology, 2018].
  * Is available as a [scanpy](https://scanpy.readthedocs.io/en/stable/) h5ad object (tool-specific format based upon the [HDF5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format#HDF5) standard)
  * Is hosted at Zenodo [#4726927](https://zenodo.org/record/4726927)
  * Consists in a [drop-seq](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4481139/) preparation of thymic cells in differenciation from 7 neonatal mice, selected by mass flow cytometry using 22 immunologic markers corresponding to multiple T-cell types and subtypes (see embedded Excel file)
  * For this practical session, will focus on these subtypes :
    * Double-negative **(DN)** : Early immature cells
    * Double-positive Module 1 **(DP-M1)** : Intermediate state 1
    * Double-positive Module 2 **(DP-M2)** : Intermediate state 2
    * Double-positive Module 3 **(DP-M3)** : Intermediate state 3
    * Double-positive Module 4 **(DP-M4)** : Intermediate state 4
    * Double-positive Late **(DP-L)** : Immature late state
    * Mature T-cells **(T-mat)**
    
```{r markers, echo=FALSE}
## Cell markers
xfun::embed_file(paste0('/shared/projects/sincellte_2022/', user.id, '/Courses/Secondary_analysis/trajectory_analysis/trajectory_rmd_resources/markers_celltypes.xlsx'))
```

### The environment

The R packages needed for this tutorial are :

* CRAN :
  * [Seurat](https://cran.r-project.org/web/packages/Seurat/index.html)
  * [dynwrap](https://cran.r-project.org/web/packages/dynwrap/index.html)
    * [dplyr](https://cran.r-project.org/web/packages/dplyr/index.html)
  * [dynplot](https://cran.r-project.org/web/packages/dynplot/index.html)
  * [dynguidelines](https://cran.r-project.org/web/packages/dynguidelines/index.html)
  * [patchwork](https://cran.r-project.org/web/packages/patchwork/index.html)
  * [RColorBrewer](https://cran.r-project.org/web/packages/RColorBrewer/index.html)
  * [viridis](https://cran.r-project.org/web/packages/viridis/index.html)
* BioConductor :
  * [slingshot](https://bioconductor.org/packages/release/bioc/html/slingshot.html)
* Github :
  * [mojaveazure/seurat-disk](https://github.com/mojaveazure/seurat-disk)
  * [Helena-todd/TInGa/package](https://github.com/Helena-todd/TInGa)
    * [rcannood/gng](https://github.com/rcannood/gng)

Additional packages needed to build this HTML report :

* CRAN :
  * [gifski](https://cran.r-project.org/web/packages/gifski/index.html)    *To generate GIF animations of markers expression*
  * [knitr](https://cran.r-project.org/web/packages/knitr/index.html)     *To set the Rmarkdown hook to compute CPU time of chunks*
  * [Hmisc](https://cran.r-project.org/web/packages/Hmisc/index.html)     *To set the button-hidden TOC*


## Preprocessing steps

***This is just a description of the preparation I performed on the data, building the the object you will work on. This is not part of the practical session.***

<details><summary>*Unfold*</summary>

### Setting our variables

```{r settings, attr.source='.numberLines', eval = FALSE}
## Remote input file
remote.h5ad.file <- 'https://zenodo.org/record/4726927/files/Final_cell_annotated_object.h5ad?download=1'
## Working directory
user.id <- Sys.getenv('USER')
root.dir <- '/shared/projects/sincellte_2022'
data.dir <- paste(c(root.dir, 'Courses', 'Secondary_analyses', 'input', 'trajectory_analysis'), collapse = '/')
work.dir <- paste(c(root.dir, user.id, 'Courses', 'Secondary_analysis', 'trajectory_analysis'), collapse = '/')
## Seurat assay name
assay <- 'RNA'
## Contaminating cell types
contam.types <- c('Macrophages', 'RBC')
## RNG seed (for irlba PCA)
my.seed <- 1337L
## HVGs to keep
nfeatures <- 3000
## 1st reduction method
reduction <- 'pca'
## Number of PCs to produce
pca.dims <- 100
## Number of PCs to retain for 2nd reduction
maxdim <- 30
## Use scaled data instead of unscaled
use.scaled <- TRUE
## Cell type to use as root to orient the trajectory
root.group <- 'DN'
```

### Loading source data

* First, the dataset was retrieved from Zenodo

```{r datadl, attr.source='.numberLines', eval = FALSE}
dir.create(path = work.dir, recursive = TRUE, showWarnings = FALSE)
local.root.file <- paste0(work.dir, '/Original_scanpy_object.h5')
local.h5ad.file <- paste0(local.root.file, 'ad')
download.file(url = h5ad.file, destfile = local.h5ad.file,
              method = 'wget', quiet = TRUE)
```

* This object was then converted to a Seurat-compatible HDF5 object using the [SeuratDisk::Convert()](https://rdrr.io/github/mojaveazure/seurat-disk/man/Convert.html) function from the  [SeuratDisk](https://github.com/mojaveazure/seurat-disk) package.

```{r h5conv, attr.source='.numberLines', eval = FALSE}
SeuratDisk::Convert(source = local.h5ad.file, dest = 'h5seurat',
                    assay = assay, verbose = FALSE, overwrite = TRUE)
```

* The converted object was loaded into R as a SeuratObject using the [SeuratDisk::LoadH5Seurat()](https://rdrr.io/github/mojaveazure/seurat-disk/man/LoadH5Seurat.html) function. All optional slots (neighbors, graphs, reductions, images), won't be loaded as we want to reprocess them.

```{r h5S4load, attr.source='.numberLines', eval = FALSE}
local.h5s.file <- paste0(local.root.file, 'seurat')
sobj <- SeuratDisk::LoadH5Seurat(file = local.h5s.file, assays = assay,
                                 reductions = FALSE, graphs = FALSE,
                                 neighbors = FALSE, images = FALSE)
```

### Prepping the object to our context

* The downloaded object contains cell-type annotation from the original analysis. From the [publication](https://www.frontiersin.org/articles/10.3389/fimmu.2018.02523/full), we learnt that two cell types are not related to T cells, thus considered as contaminants : **macrophages** and **red blood cells (RBC)**. As their presence in the analysis space can modify the shape of the cells position, thus impact trajectory analysis, we decided to remove them :

```{r nonTout, attr.source='.numberLines', eval = FALSE}
sobj <- sobj[, !(sobj$cell_type %in% contam.types)]
```

* We then follow the **Seurat4** canonical pipeline, applying **log-normalization** to the count data ...

```{r S4lognorm, attr.source='.numberLines', eval = FALSE}
sobj <- Seurat::NormalizeData(object = sobj, assay = assay)
```

* ... then **scaling**. As this object is a merge of 7 experiments (labelled as 'batch' in the object metadata), we want to regress this effect.

```{r S4scaling, attr.source='.numberLines', eval = FALSE}
sobj <- Seurat::ScaleData(object = sobj, assay = assay,
                          vars.to.regress = c('batch'))
```

* ... then identify **HVG**s. As we focus on subtypes of a single, main cell type (T-cells) in a differentiation context, we anticipate that we may have limited variation of gene expression (at least compared to, by example, multiple cell types of a differentiated organ). Thus, we increase the default number of features to output from 2,000 to *3,000*.

```{r hvgs3k, attr.source='.numberLines', eval = FALSE}
sobj <- Seurat::FindVariableFeatures(object = sobj, assay = assay,
                                     verbose = FALSE, nfeatures = nfeatures)
```

* ... then reducing dimensions (through **PCA**). We request 100 PCs as output.

```{r red_pca, attr.source='.numberLines', eval = FALSE}
sobj <- Seurat::RunPCA(object = sobj, assay = assay, npcs = pca.dims,
                       reduction.name = reduction, seed.use = my.seed,
                       verbose = FALSE)
```

* ... then building additional reductions for exploration / visualization (**t-SNE**, **uMAP**). After manual exploration, we decided to retain 30 PCs as input.

```{r red_tsne_umap, attr.source='.numberLines', eval = FALSE}
## TSNE
reduction2t <- paste0(reduction, '.', maxdim, '_tsne')
sobj <- Seurat::RunTSNE(object = sobj, dims = 1:maxdim, assay = assay,
                        reduction = reduction, seed.use = my.seed,
                        verbose = FALSE, reduction.name = reduction2t)
## uMAP
reduction2u <- paste0(reduction, '.', maxdim, '_umap')
sobj <- Seurat::RunUMAP(object = sobj, dims = 1:maxdim, assay = assay,
                        reduction = reduction, seed.use = my.seed,
                        verbose = FALSE, reduction.name = reduction2u)
Seurat::DimPlot(object = sobj, reduction = reduction2u, group.by = 'cell_type',
                seed = my.seed)
```

* Lastly, we save the processed object (as a RDS file).

```{r savepobj, attr.source='.numberLines', eval = FALSE}
saveRDS(object = sobj, file = paste0(work.dir,'/Thymus_neonat_mm_Seurat_ENSEMBL.RDS'),
        compress = 'bzip2')
```

</details>

## Trajectory analysis using TInGa

We will use a tool named [TInGa](https://academic.oup.com/bioinformatics/article/36/Supplement_1/i66/5870517) to perform this analysis, due to its convenience to our educational purpose :

  * It's in full R
  * It's very fast
  * Its parameters are limited (as compared to other renowned tools like **Monocle3**)
  * It is compatible with most topologies (see [here](https://academic.oup.com/view-large/figure/205480453/btaa463f2.tif))
  * It's integrated in the versatile **dynverse** (as a plugin)

### Setting parameters

```{r Tsettings, attr.source='.numberLines', eval = TRUE}
## Working directory
user.id <- Sys.getenv('USER')
root.dir <- '/shared/projects/sincellte_2022'
data.dir <- paste(c(root.dir, 'Courses', 'Secondary_analyses', 'input',
                    'trajectory_analysis'), collapse = '/')
work.dir <- paste(c(root.dir, user.id, 'Courses', 'Secondary_analysis',
                    'trajectory_analysis'), collapse = '/')
if(!dir.exists(work.dir)) dir.create(path = work.dir, recursive = TRUE)
## Local input file
## The file created if you performed the preprocessing by yourself
# local.Seurat.file <- paste0(work.dir, '/Thymus_neonat_mm_Seurat_ENSEMBL.RDS')
## The preprocessed file I prepared for you
local.Seurat.file <- paste0(data.dir, '/Thymus_neonat_mm_Seurat_ENSEMBL.RDS')
## Seurat assay name
assay <- 'RNA'
## RNG seed (for irlba PCA)
my.seed <- 1337L
## 1st reduction method
reduction <- 'pca'
## 2nd reduction method
reduction2 <- 'pca.30_umap'
## Use scaled data instead of unscaled
use.scaled <- TRUE
## Cell type to use as root to orient the trajectory
root.group <- 'DN'
## Marker genes to plot
trans.markers <- c('Il2ra', 'Npm1', 'Nusap1', 'Ube2c',
                   'Pclaf','Cd8b1', 'Itm2a', 'Cd52', 'H2-D1')
```

### Loading source data

We will simply load the pre-processed Seurat object, contained in a RDS archive, as input :

```{r Tload, attr.source='.numberLines', eval = TRUE, time_it = TRUE}
sobj <- readRDS(file = local.Seurat.file)
```

### Performing trajectory analysis

#### The reduced space

First, to impregnate us with the cells topology on which TInGa will work, we can display our 2D dimension reduction (uMAP or t-SNE) :

```{r Sdim2plot, attr.source='.numberLines', eval = TRUE, fig.width = 15, fig.height = 5}
## With batches
db <- Seurat::DimPlot(object = sobj, reduction = reduction2, group.by = 'batch',
                      pt.size = 2) + Seurat::DarkTheme()
## With cell types
dct <- Seurat::DimPlot(object = sobj, reduction = reduction2, group.by = 'cell_type',
                       pt.size = 2) + Seurat::DarkTheme()
## With sex
ds <- Seurat::DimPlot(object = sobj, reduction = reduction2, group.by = 'sex',
                      pt.size = 2) + Seurat::DarkTheme()
print(patchwork::wrap_plots(list(db, dct, ds)))
```

```{r rm1, echo = FALSE}
rm(db, dct, ds)
```

#### Extracting data

Using TInGa first requires to build a specific object from the count and normalized matrices, that we will extract from our Seurat object :

```{r matextract, attr.source='.numberLines', eval = TRUE}
## Matrices are required to be formatted as [cells] x [features], thus a transposition is required
## Normalized matrix
my.exp <- t(as.matrix(if(use.scaled) sobj@assays[[assay]]@scale.data
                      else sobj@assays[[assay]]@data)) 
### Ordering to keep synch in the [use.scaled] context
my.exp <- my.exp[, order(colnames(my.exp))]
## Count matrix
my.counts <- t(as.matrix(if(use.scaled) sobj@assays[[assay]]@counts[rownames(sobj@assays[[assay]]@counts)%in% colnames(my.exp),]
                         else sobj@assays[[assay]]@counts))
### Ordering to keep synch in the [use.scaled] context
my.counts <- my.counts[, order(colnames(my.counts))]
```

#### Building a dynwrap object

We can now build our object (a **dynwrap** object, which actually consists in a list with mandatory entries) :

```{r Tobj, attr.source='.numberLines', eval = TRUE, time_it = TRUE}
my.data <- dynwrap::wrap_expression(
  id = sobj@project.name,
  expression = my.exp,
  counts = my.counts
)
is(my.data)
```

```{r rm2, echo = FALSE}
rm(my.exp, my.counts)
```

#### Import prior

We can import our **reduced space** (t-SNE or uMAP) as a prior, which will be used by TInGa to build the trajectory :

```{r Tredimport, attr.source='.numberLines', eval = TRUE}
my.data <- dynwrap::add_prior_information(
  dataset = my.data,
  dimred = sobj@reductions[[reduction2]]@cell.embeddings
)
```

#### Running TInGa

Now, we can run TInGa !

```{r Tone, attr.source='.numberLines', eval = TRUE, fig.align = 'center', time_it = TRUE}
## Performing TInGa analysis
set.seed(my.seed)
my.traj <- dynwrap::infer_trajectory(dataset = my.data, method = TInGa::gng_param2(),
                                     seed = my.seed, give_priors = 'dimred',
                                     parameters = list(max_nodes = 8, lambda = 200))
## Adding pseudotime
my.traj <- dynwrap::add_pseudotime(trajectory = my.traj)
## Plotting with pseudotime
dynplot::plot_dimred(trajectory = my.traj, label_milestones = TRUE,
                     color_cells = 'pseudotime') + Seurat::DarkTheme()
```

#### Adding cell types

We can import our cell types (as a metadata for later plots) in the trajectory object :

```{r Ttypeimport, attr.source='.numberLines', eval = TRUE, fig.align = 'center'}
## Plotting cell types as grouping
my.traj <- dynwrap::add_grouping(dataset = my.traj,
                                 grouping = as.character(sobj$cell_type))
## Plotting with cell types
dynplot::plot_dimred(trajectory = my.traj, label_milestones = TRUE,
                     color_cells = "grouping") + Seurat::DarkTheme()
```

#### Adding a root

* As we know our root cell type (DN), we can give a direction to the trajectory. However, TInGa requires a **unique root cell**, not a root **cluster**... Thus, we will first identify a putative root cell by computing the centroid of the DN cells in the reduced space :

```{r Tcentroids, attr.source='.numberLines', eval = TRUE, fig.width = 10, fig.height = 5, fig.align = 'center', time_it = TRUE}
if (!is.null(root.group)) {
  ## Identifying cells of the root group
  cells_in_root.group <- which(my.traj$grouping == root.group)
  ## Computing the root group medoid in the reduced space
  root_group.medoid <- apply(my.traj$dimred[cells_in_root.group,], 2, median)
  ## Computing the root group cells distance to root group medoid
  cells.dist_to_medoid <- vapply(cells_in_root.group, function(x) { norm(as.matrix(my.traj$dimred[x,] - root_group.medoid), 'F')}, .1)
  ## Identifying the root cell
  root.cell.id <- my.traj$cell_info$cell_id[cells_in_root.group[which.min(cells.dist_to_medoid)]]
  message(paste0(root.group, ' root cell : ', root.cell.id))
  ## Plotting cell types
  dct <- Seurat::DimPlot(object = sobj, reduction = reduction2, 
                         group.by = 'cell_type', pt.size = 2) + Seurat::DarkTheme()
  ## Plotting the root cell
  sobj$root_cell <- FALSE
  sobj$root_cell[colnames(sobj) == root.cell.id] <- TRUE
  drc <- Seurat::DimPlot(object = sobj, reduction = reduction2, 
                         group.by = 'root_cell', pt.size = c(rep(2, ncol(sobj)-1), 5),
                         order = TRUE) + Seurat::DarkTheme()
  print(patchwork::wrap_plots(list(dct, drc)))
}
```

```{r rm3, echo = FALSE}
rm(dct, drc)
```


* Rooted plots

```{r Trooted, attr.source='.numberLines', eval = TRUE, fig.width = 10, fig.height = 5, fig.align = 'center'}
## Adding root cell
my.traj <- dynwrap::add_root(trajectory = my.traj, root_cell_id = root.cell.id)
## Plotting with cell types
dct <- dynplot::plot_dimred(trajectory = my.traj, label_milestones = TRUE, 
                            color_cells = "grouping") + Seurat::DarkTheme()
## Plotting with pseudotime
dpt <- dynplot::plot_dimred(trajectory = my.traj, label_milestones = TRUE, 
                            color_cells = "pseudotime") + Seurat::DarkTheme()
print(patchwork::wrap_plots(list(dct, dpt)))
```

```{r rm4, echo = FALSE}
rm(dct, dpt)
```

#### Plotting marker genes

* First, we'll replace EnsemblIDs with MGI Symbols (easier to handle)
  
```{r addsymb, attr.source='.numberLines', eval = TRUE}
## Injecting symbols
symbvec <- as.character(sobj@assays$RNA@meta.features$Symbol)
names(symbvec) <- sobj@assays$RNA@meta.features$ID
colnames(my.data$counts) <- colnames(my.data$expression) <- unname(symbvec[colnames(my.data$expression)])
```
  
* Plotting markers :
  * From milestone 1 to 3 :

```{r TMm13, attr.source='.numberLines', eval = TRUE, fig.width = 15, fig.height= 15, fig.align = 'center', time_it = TRUE}
## Milestones 1 to 3 ...
tmplots <- sapply(trans.markers[1:4], function(x) {
  dynplot::plot_dimred(trajectory = my.traj, label_milestones = TRUE, 
                       color_cells = "feature", feature_oi = x, 
                       expression_source = my.data$expression) + Seurat::DarkTheme()
  }, simplify = FALSE)
print(patchwork::wrap_plots(tmplots))
```

```{r rm5, echo = FALSE}
rm(tmplots)
```

  * From milestone 1 to 2 :

```{r TMm12, attr.source='.numberLines', eval = TRUE, fig.width = 15, fig.height= 15, fig.align = 'center', time_it = TRUE}
## Milestones 1 to 2 ...
tmplots <- sapply(trans.markers[c(1:2,5:9)], function(x) {
  dynplot::plot_dimred(trajectory = my.traj, label_milestones = TRUE, 
                       color_cells = "feature", feature_oi = x, 
                       expression_source = my.data$expression) + Seurat::DarkTheme()
  }, simplify = FALSE)
print(patchwork::wrap_plots(tmplots))
```

```{r rm6, echo = FALSE}
rm(tmplots)
```

#### Animated !

* From milestone 1 to 3 :

```{r Tanim2a, attr.source='.numberLines', animation.hook="gifski", interval = .33, eval = FALSE, echo = TRUE}
## Milestones 1 to 3 ...
for (x in trans.markers[1:4]) print(
  dynplot::plot_dimred(trajectory = my.traj, label_milestones = TRUE, 
                       color_cells = "feature", feature_oi = x, 
                       expression_source = my.data$expression) + Seurat::DarkTheme())
```

<center>![anim2a](/shared/projects/sincellte_2022/`r user.id`/Courses/Secondary_analysis/trajectory_analysis/trajectory_rmd_resources/Tanim2a.gif)</center><br>

* From milestone 1 to 2 :

```{r Tanim2b, attr.source='.numberLines', animation.hook="gifski", interval = .33, eval = FALSE, echo = TRUE}
## Milestones 1 to 2 ...
for (x in trans.markers[c(1:2,5:9)]) print(
  dynplot::plot_dimred(trajectory = my.traj, label_milestones = TRUE, 
                       color_cells = "feature", feature_oi = x, 
                       expression_source = my.data$expression) + Seurat::DarkTheme())
```

<center>![anim2b](/shared/projects/sincellte_2022/`r user.id`/Courses/Secondary_analysis/trajectory_analysis/trajectory_rmd_resources/Tanim2b.gif)</center><br>

# Trajectory analysis using Slingshot

* [Slingshot](https://bioconductor.org/packages/release/bioc/html/slingshot.html) is a way more commonly used trajectory inference tool than TInGa *(that I chose for educational purpose, among other things)*.
* Like TInGa, it's a full-R, easy to use tool that is compatible with most trajectory topologies.
* However, it does have a key difference : compared to TInGa, it additionally requires a cell clustering results. This may be either an advantage (if you are confident in the capacity of you clustering results to reflect the biology, espacially the cells lineage) or a con (if you are not).

* Running Slingshot

```{r slingrun, attr.source='.numberLines', time_it = TRUE}
## Run slingshot with root cluster
sling.res <- slingshot::slingshot(data = sobj@reductions$pca.30_umap@cell.embeddings,
                                  clusterLabels = sobj$cell_type, start.clus = 'DN')
## Plot results (lineage)
plot(sobj@reductions$pca.30_umap@cell.embeddings, 
     col = RColorBrewer::brewer.pal(9,"Set1")[sobj$cell_type], 
     asp = 1, pch = 16)
lines(slingshot::as.SlingshotDataSet(sling.res), type = 'lineages')
## Plot results (fitted curve)
plot(sobj@reductions$pca.30_umap@cell.embeddings, 
     col = RColorBrewer::brewer.pal(9,"Set1")[sobj$cell_type], 
     asp = 1, pch = 16)
lines(slingshot::as.SlingshotDataSet(sling.res))
```

Slingshot identified 3 lineages, much in concordance with TInGa results !

* Getting cells pseudotimes (per lineage)

```{r slingpt, attr.source='.numberLines', fig.width = 15, fig.height = 5, time_it = TRUE}
## Getting pseudotimes
my.pseudo <- slingshot::slingPseudotime(x = sling.res)
pal <- viridis::viridis(100, end = 0.95)

oripar <- par(no.readonly = TRUE)
par(mfrow = c(1,3))
for (lin.x in 1:ncol(my.pseudo)) {
  colors <- pal[cut(my.pseudo[,lin.x], breaks = 100)]
  plot(sobj@reductions$pca.30_umap@cell.embeddings, col = colors, 
       pch = 16, cex = 0.5, main = colnames(my.pseudo)[lin.x])
  lines(slingshot::as.SlingshotDataSet(sling.res))
}
par(oripar)
```

**NOTE** : Slingshot is also available through the [dynverse](https://dynverse.org/reference/dynmethods/method/ti_slingshot/) !

# Limits

  * No one method to rule them all (some are or more efficient for some specific types of topology, terrible with others)
  * Dimension reduction : in the name of interpretation/visualization *(and CPU time)*, but *is it relevant* ?
    * [Cooley et al.](https://www.biorxiv.org/content/10.1101/689851v6) deposited a paper in 2019 on biorXiv (slight warning : thus, non-peer-reviewed yet), updated very recently, stating that trajectory inference based on reduced spaces **fail in every case** !
    * The only working case according to them : using PCA with ... **almost all dimensions** ?!
      * *Not a reduced space anymore ?*
    * Why : all tested reduction methods sacrifice too much information to be able to represent the true topology right without breakage, just so that we can visualize it with our [poor human brains](https://youtu.be/MtObrnaovrI) ...
    * So, how to deal with it ?
      * As a Maths teacher told me, ***"Mathematics are an art of reasoning right on wrong images"***. However, in our case we would call it observer bias!
      * A solution could be to generate pseudotime values in a high-dimension space (like the recommended "almost-full' PCA), *without visualization*.
        * Drawback : how to evaluate, how to interpret such results ? The question will remain open for today.


<center>![logo](/shared/projects/sincellte_2022/`r user.id`/Courses/Secondary_analysis/trajectory_analysis/trajectory_rmd_resources/abstraction.png)</center>
<br>

# Links

* Publications :
  * dynverse [\@nature biotech](https://www.nature.com/articles/s41587-019-0071-9) [\@biorxiv](https://www.biorxiv.org/content/biorxiv/early/2018/03/05/276907.full.pdf)
  * [TInGa](https://academic.oup.com/bioinformatics/article/36/Supplement_1/i66/5870517)
  * [Slingshot](https://bmcgenomics.biomedcentral.com/track/pdf/10.1186/s12864-018-4772-0.pdf)
  * An interesting [comment/review](https://prelights.biologists.com/highlights/a-novel-metric-reveals-previously-unrecognized-distortion-in-dimensionality-reduction-of-scrna-seq-data/) of the recent Cooley paper about the failure of dimension reduction in TI.
* A hitchhiker's guide to the [dynverse](https://dynverse.org/users/) *(DON'T PANIC!)*
* A quick [Slingshot tutorial](https://bustools.github.io/BUS_notebooks_R/slingshot.html#slingshot) by the team behind Kallisto/BUStools
* Recent tools to keep an eye on :
  * [VIA](https://www.nature.com/articles/s41467-021-25773-3) (python)
  * [LISA2](https://www.frontiersin.org/articles/10.3389/fgene.2021.681206/full) (R)

# R session

```{r sessioninfo, echo = FALSE}
sessionInfo()
```
