sincellTE 2024 - RNA velocity analysis with velociraptor

Published

Oct 22, 2024

Objectives

RNA velocity is a popular single cell trajectory inference method. The objective of this session is to get an overview of the basic principles of performing RNA velocity analysis in R, using the velociraptor package, which provides a wrapper around the scVelo python package.

1. Data preparation

For this session, we will use an example data set from Hermann et al. (2018) about spermatogenesis https://doi.org/10.1016/j.celrep.2018.10.026.

For RNA velocity, spliced and unspliced RNA count matrices are needed. They can be obtained by making sure that introns are kept when running cellranger (--include-introns = TRUE) then performing velocyto on the output bam file. This will provide a .loom file containing expression matrices for both spliced and unspliced RNA.

Then, the data can be loaded into R and analysed. One problem appears: velociraptor takes SingleCellExperiment objects as input and is not compatible with Seurat object.

Note

Just like Seurat objects, SingleCellExperiment class is a lightweight Bioconductor container for storing and manipulating single-cell genomics data. Its organization and syntax is just a little different.

In order to avoid confusion, we will not go into the details. If you need to use this class in the future, you can find documentation here: https://www.bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html

Besides, the single cell data need to be prepared for RNA velocity analysis, according to the steps we already know: filtering, normalization, identification of highly variable genes, dimensionality reduction… Several strategies are possible:

  • use packages compatible with SingleCellExperiment to prepare the data. The following code illustrates these steps but will not be run.
library(scRNAseq)
library(scuttle)
library(scater)
library(scran)
library(LoomExperiment)

# Import velocyto output
loom <- import("path/to/file.loom")

assayNames(loom)

# Extract the matrices from the appropriate assays (adjust the assay names as needed)
count_matrix <- assay(loom, "matrix")  # This is typically the raw count matrix
spliced_matrix <- assay(loom, "spliced")  # The spliced matrix (if present)
unspliced_matrix <- assay(loom, "unspliced")  # The unspliced matrix (if present)

# Check the dimensions of the expression matrix
dim(count_matrix)
dim(spliced_matrix)
dim(unspliced_matrix)

# Create SingleCellExperiment object
sce <- SingleCellExperiment(
    assays = list(
        counts = count_matrix,
        spliced = spliced_matrix,
        unspliced = unspliced_matrix
    )
)

cell_metadata <- colData(loom)
colData(sce) <- cell_metadata

gene_metadata <- rowData(loom)
rowData(sce) <- gene_metadata

# Inspect the object
sce

## Calculate log-normalized counts
sce <- scuttle::logNormCounts(sce)

## Find highly variable genes
dec <- scran::modelGeneVar(sce)
top.hvgs <- scran::getTopHVGs(dec, n = 2000)

## Perform dimensionality reduction
set.seed(100)
sce <- scater::runPCA(sce, subset_row = top.hvgs)
sce <- scater::runTSNE(sce, dimred = "PCA")
sce <- scater::runUMAP(sce, dimred = "PCA", n_neighbors = 30)
  • let velociraptor prepare the data automatically. The user will have less freedom regarding the option values.

  • analyse the data with Seurat and construct a SingleCellExperiment object with the results.

For this session, we will load and visualize data already prepared.

suppressPackageStartupMessages({
  library(tidyverse)
  library(SingleCellExperiment)
  library(scater)
  library(velociraptor)
  library(pheatmap)
  library(ggplot2)
  library(viridis)
})
user.id <- Sys.getenv("USER")
input_file <-paste("/shared/home2", user.id, "velocity/data/hermann.rds", sep = "/")

## load SingleCellExperiment object
sce <- readRDS(input_file)

## Plot tSNE representation
scater::plotTSNE(sce, colour_by = "celltype")

scater::plotUMAP(sce, colour_by = "celltype")

2. RNA velocity estimation

Next, we will perform the RNA velocity estimation.

In velociraptor, this is done using the scvelo() function, which provides a wrapper around the functionality of scVelo.

Take a moment to study the documentation of scvelo() - you will see for example that all the modes of scVelo (steady-state, stochastic, dynamical) can be used.

For our example, we will use the dynamical mode. It takes a bit longer to run than the steady state option, but typically gives more reliable results.

# too long, do not run
top.hvgs <- scran::getTopHVGs(dec, n = 2000)

velo.out <- velociraptor::scvelo(sce, subset.row = top.hvgs,
                                 assay.X = "counts",
                                 mode = "dynamical")
results_file <-paste("/shared/home2", user.id, "velocity/data/velo.out.rds", sep = "/")
velo.out <- readRDS(results_file)
Note

The argument use.theirs (default = FALSE) can be set to TRUE to perform gene filtering and normalization with scVelo.

The scvelo() function performs the full RNA velocity estimation analysis, and in the process calls several functions from scVelo. Many steps are computed during the function execution:

  • compute neighbors and moments
  • estimate the kinetics parameters
  • deduce the velocity (i.e. infer the future cell state)
  • estimate velocity pseudotime and latent time. Both estimates try to provide a relative ordering of the cells, only the computation method is different

Its output of scvelo() is another SingleCellExperiment object.

velo.out
class: SingleCellExperiment 
dim: 2000 2325 
metadata(5): neighbors recover_dynamics velocity_params velocity_graph
  velocity_graph_neg
assays(10): X spliced ... velocity velocity_u
rownames(2000): ENSMUSG00000038015.6 ENSMUSG00000022501.6 ...
  ENSMUSG00000033813.15 ENSMUSG00000029088.16
rowData names(18): fit_r2 fit_alpha ... velocity_genes varm
colnames(2325): CCCATACTCCGAAGAG AATCCAGTCATCTGCC ... ATCCACCCACCACCAG
  ATTGGTGGTTACCGAT
colData names(8): velocity_self_transition root_cells ...
  velocity_confidence velocity_confidence_transition
reducedDimNames(1): X_pca
mainExpName: NULL
altExpNames(0):

The results of scvelo() are not added to the input SingleCellExperiment object, since the two objects can have different dimensions (in particular note the difference in the number of genes). Thus, in order to use information from the velocity calculations together with information from the original object, we need to copy slots from one object to the other.

Here, we copy the reduced dimension representations from the original SingleCellExperiment object to the output of scvelo(), and use the latter to construct tSNE representations colored by various parameters.

## copy dimension reduction into scvelo output
reducedDims(velo.out) <- reducedDims(sce)

## copy cell annotation into scvelo output
velo.out$celltype <- sce$celltype

3. Velocity plots

3.1 Trajectory overview

The results of the velocity estimation can be plotted on the previous low-dimension representation of the data

## plot scvelo results
cowplot::plot_grid(
    scater::plotTSNE(velo.out, colour_by = "celltype"),
    scater::plotTSNE(velo.out, colour_by = "velocity_pseudotime"),
    scater::plotTSNE(velo.out, colour_by = "latent_time"),
    ncol = 1, align = "v"
)

The velociraptor package also provides plotting functions to visualize the RNA velocity results.

For example, we can embed the estimated velocities into a provided low-dimensional space and plot them as a vector field.

embedded <- velociraptor::embedVelocity(reducedDim(velo.out, "TSNE"), velo.out)
ℹ Using the 'X' assay as the X matrix
computing velocity embedding
    finished (0:00:00) --> added
    'velocity_target', embedded velocity vectors (adata.obsm)
grid.df <- velociraptor::gridVectors(reducedDim(velo.out, "TSNE"), embedded)

plotTSNE(velo.out, dimred = "TSNE", colour_by = "velocity_pseudotime") +
    geom_segment(data = grid.df,
                 mapping = aes(x = start.TSNE1, y = start.TSNE2,
                               xend = end.TSNE1, yend = end.TSNE2),
                 arrow = arrow(length = unit(0.05, "inches")))

3.2 Quality checks

Next, we will look at individual genes via their phase portraits. This is often useful in order to understand how the velocities are affected/supported by specific genes.

We can of course plot genes that we are already familiar with and that we know are related to the process of interest. We can also extract genes with particularly strong influence on the velocity results. ‘Driver genes,’ which display a strong dynamic behavior, can for example be detected via their high likelihoods in the dynamic model.

A score called likelihood value of the fit (fit_likelihood) indicates how well a cell is described by the learned phase trajectory. Averaged across all the cells, it can be used to estimate the quality of a gene in the velocity estimation. Here, we select the genes with the largest and smallest likelihood value for the fit and plot their phase portrait.

toplh <- rowData(velo.out) %>% 
  as.data.frame() %>% 
  arrange(desc(fit_likelihood)) %>% 
  slice_head(n=100) %>% 
  rownames()

plotVelocity(velo.out, use.dimred = "TSNE", genes = toplh[1:6],
             color_by = "celltype")

botlh <- rowData(velo.out) %>% 
  as.data.frame() %>% 
  arrange(fit_likelihood) %>% 
  slice_head(n=3) %>% 
  rownames()

plotVelocity(velo.out, use.dimred = "TSNE", genes = botlh,
             color_by = "celltype")
Warning: Removed 10 rows containing missing values or values outside the scale range
(`geom_point()`).

We will also illustrate two summary statistics returned by scVelo:

  • the length of the velocity vector encodes the speed or rate of differentiation.

  • the ‘velocity confidence’ provides a measure of the coherence of the velocity vector field, that is, how well the velocity vector for a cell correlates with those of its neighbors.

cowplot::plot_grid(
    scater::plotTSNE(velo.out, colour_by = "velocity_length"),
    scater::plotTSNE(velo.out, colour_by = "velocity_confidence"),
    ncol = 1, align = "v"
)

3.3 Gene expression heatmap

A classical plot provided by the python version of scVelo is the expression of genes as a function of pseudotime. In R, it will need a little more efforts.

## add genes counts and log normalized counts to the scvelo object
assay(velo.out, "counts") <- assay(velo.out, "X")
velo.out <- scuttle::logNormCounts(velo.out)

## order cells according to pseudo-time
cell_order <- velo.out@colData %>% 
  as.data.frame() %>% 
  arrange(velocity_pseudotime) %>% 
  rownames()

df <- assay(velo.out, "Ms")[toplh, cell_order]

## smooth signal with convolution
kernel_size <- 30
kernel <- rep(1/kernel_size, kernel_size)  # Simple moving average kernel

df_convolved <- apply(df, 1, function(x) {
  stats::filter(x, filter = kernel, sides = 2)  # Apply convolution with a centered kernel
})

genes_order <- colnames(df_convolved)[order(apply(df_convolved, 2, which.max))]

## extract ordered matrix
df <- (assay(velo.out, "logcounts"))[genes_order,cell_order]

## prepare heatmap annotation
annot <- velo.out@colData[,c("celltype", "latent_time")] %>% 
  as.data.frame()

annot_color <- list(latent_time = c("Blue", "Yellow"))

quantile_breaks <- function(xs, n = 10) {
  breaks <- quantile(xs, probs = seq(0, 1, length.out = n))
  breaks[!duplicated(breaks)]
}

mat_breaks <- quantile_breaks(scale(df), n = 52)

## plot heatmap
pheatmap::pheatmap(df,
         color = viridis(n=51),
         scale = "row",
         breaks = mat_breaks,
         cluster_col = F,
         annotation_col = annot,
         annotation_colors = annot_color,
         show_colnames = F,
         cluster_rows = F,
         fontsize = 6)

R/python interoperability

In this tutorial, we will use the velociraptor R/Bioconductor package to run the RNA velocity analysis. This package provides a wrapper around the scVelo python package, and lets you run most of the essential analysis steps directly from your R session.

However, in other situations you may want to use scVelo or other python packages directly, despite performing the initial analysis in R.

A large number of tools from the python-based single-cell ecosystem are based around the so callled AnnData format, which is a type of python object that can hold largely similar types of information as SingleCellExperiment or Seurat objects. In order to use our preprocessed data in such python tools, we first need to convert our SingleCellExperiment to AnnData format.

The zellkonverter package provides a convenient way to convert between these two formats. To illustrate how it works, let us convert the SingleCellExperiment object to AnnData format and save it to a file that can be directly read into python.

suppressPackageStartupMessages({
    library(zellkonverter)
})

zellkonverter::writeH5AD(hermann, file = "Hermann_AnnData.h5ad",
                         X_name = "counts")

In a python session, we could read this object back and continue with any downstream analysis. More information about the AnnData format can be found in AnnData documentation.

zellkonverter can also be used to read an .h5ad file (e.g., that was saved from python) into an R session:

sce <- zellkonverter::readH5AD("Hermann_AnnData.h5ad")

Aknowledgement

This document is an adaptation of C. Soneson’s previous work for the school sincellTE 2022. You can consult the complete and very detailed course and practical session at this address: https://csoneson.github.io/sincellTE2022/

References and documentation

scVelo: https://scvelo.readthedocs.io/en/stable/index.html

velociraptor: https://www.bioconductor.org/packages/release/bioc/vignettes/velociraptor/inst/doc/velociraptor.html

Bergen, Volker, Marius Lange, Stefan Peidli, F Alexander Wolf, and Fabian J Theis. 2020. “Generalizing RNA Velocity to Transient Cell States Through Dynamical Modeling.” Nature Biotechnology 38: 1408–14.

Hermann, Brian P, Keren Cheng, Anukriti Singh, Lorena Roa-De La Cruz, Kazadi N Mutoji, I-Chung Chen, Heidi Gildersleeve, et al. 2018. “The Mammalian Spermatogenesis Single-Cell Transcriptome, from Spermatogonial Stem Cells to Spermatids.” Cell Rep. 25: 1650–1667.e8.

Session info

sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-conda-linux-gnu
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS/LAPACK: /shared/software/miniconda/envs/r-4.4.1/lib/libopenblasp-r0.3.27.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Paris
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] viridis_0.6.5               viridisLite_0.4.2          
 [3] pheatmap_1.0.12             velociraptor_1.14.3        
 [5] scater_1.32.1               scuttle_1.14.0             
 [7] SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0
 [9] Biobase_2.64.0              GenomicRanges_1.56.1       
[11] GenomeInfoDb_1.40.1         IRanges_2.38.1             
[13] S4Vectors_0.42.1            BiocGenerics_0.50.0        
[15] MatrixGenerics_1.16.0       matrixStats_1.4.1          
[17] lubridate_1.9.3             forcats_1.0.0              
[19] stringr_1.5.1               dplyr_1.1.4                
[21] purrr_1.0.2                 readr_2.1.5                
[23] tidyr_1.3.1                 tibble_3.2.1               
[25] ggplot2_3.5.1               tidyverse_2.0.0            

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1          farver_2.1.2             
 [3] vipor_0.4.7               filelock_1.0.3           
 [5] fastmap_1.2.0             digest_0.6.37            
 [7] rsvd_1.0.5                timechange_0.3.0         
 [9] lifecycle_1.0.4           magrittr_2.0.3           
[11] compiler_4.4.1            rlang_1.1.4              
[13] tools_4.4.1               utf8_1.2.4               
[15] yaml_2.3.10               knitr_1.48               
[17] labeling_0.4.3            S4Arrays_1.4.1           
[19] htmlwidgets_1.6.4         reticulate_1.39.0        
[21] DelayedArray_0.30.1       RColorBrewer_1.1-3       
[23] abind_1.4-8               BiocParallel_1.38.0      
[25] zellkonverter_1.14.1      withr_3.0.1              
[27] grid_4.4.1                fansi_1.0.6              
[29] beachmat_2.20.0           colorspace_2.1-1         
[31] scales_1.3.0              cli_3.6.3                
[33] rmarkdown_2.28            crayon_1.5.3             
[35] generics_0.1.3            rstudioapi_0.16.0        
[37] httr_1.4.7                tzdb_0.4.0               
[39] DelayedMatrixStats_1.26.0 ggbeeswarm_0.7.2         
[41] zlibbioc_1.50.0           parallel_4.4.1           
[43] XVector_0.44.0            basilisk_1.16.0          
[45] vctrs_0.6.5               Matrix_1.7-0             
[47] dir.expiry_1.12.0         jsonlite_1.8.9           
[49] BiocSingular_1.20.0       patchwork_1.3.0          
[51] hms_1.1.3                 BiocNeighbors_1.22.0     
[53] ggrepel_0.9.6             irlba_2.3.5.1            
[55] beeswarm_0.4.0            glue_1.8.0               
[57] codetools_0.2-20          cowplot_1.1.3            
[59] stringi_1.8.4             gtable_0.3.5             
[61] UCSC.utils_1.0.0          ScaledMatrix_1.12.0      
[63] munsell_0.5.1             pillar_1.9.0             
[65] basilisk.utils_1.16.0     htmltools_0.5.8.1        
[67] GenomeInfoDbData_1.2.12   R6_2.5.1                 
[69] sparseMatrixStats_1.16.0  evaluate_1.0.0           
[71] lattice_0.22-6            png_0.1-8                
[73] Rcpp_1.0.13               gridExtra_2.3            
[75] SparseArray_1.4.8         xfun_0.48                
[77] pkgconfig_2.0.3