1 PREAMBLE

1.1 Purpose of this session

This file describes the different steps to perform the first part of the single cell RNAseq data analysis training course for the EBAII n1 2024, covering these steps :

  • Load a single cell raw counts matrix

  • Remove barcodes from empty droplets

  • Esimate and remove ambient RNA contamination (“soup”)

1.2 An important notice

  • These early (but mandatory) steps of the analysis are not covered by the renowned, higher-level analysis frameworks like Seurat nor scran/scater.

  • Thus, we will write/use a certain amount of lines of code, some of which of a higher complexity for R beginners.

  • Newcomer, do not fear !

    • While you are welcome to try writing some by yourself, depending on your R knowledge and self-confidence in it …

    • … you do not have to know how these code work …

    • … thus you should not feel ashamed copy-pasting the pre-written code hereafter …

    • … but in both cases, please try your best understanding what these code do, and even more, why !

    • In this purpose, you may use whatever help you can by searching.

1.3 What to run ?

  • This training presentation, written in Rmarkdown, contains many chunks ( = code blocks)

  • You do not have to run them all !

  • Chunks follow a color code :

  • Chunks you are expected to run :

    ## Example of chunk to run
    getwd()

    Show output

    [1] "/shared/projects/2422_ebaii_n1/atelier_scrnaseq/RMD/Preproc.1"


  • Chunks you are NOT expected to run (we will demo them). You may run them but if you are late afterwards, you’ll be the one to blame !

    ## Example of chunk to NOT run
    getwd()

    Show output

    [1] "/shared/projects/2422_ebaii_n1/atelier_scrnaseq/RMD/Preproc.1"


  • Additional exercises / questions beyond the scope of the training course, you may try if you feel bored

    ## Example of chunk to play with but beyond the scope
    getwd()

    Show output

    [1] "/shared/projects/2422_ebaii_n1/atelier_scrnaseq/RMD/Preproc.1"


  • Questions we would like you to answer during the training course

    What's the Answer to Life ? The Universe ? Everything ??


  • Answers to these

    cat("The Answer to the Great Question... is ... 
    Forty-two. Six by nine : forty-two. That's it. 
    That's all there is. (I always thought something was 
    fundamentally wrong with the universe.)")

    Show output

    The Answer to the Great Question... is ... 
    Forty-two. Six by nine : forty-two. That's it. 
    That's all there is. (I always thought something was 
    fundamentally wrong with the universe.)


2 Start Rstudio



3 Warm-up

  • We set common parameters we will use throughout this session :
## Set your project name
project_name <- "ebaii_sc_teachers"  # Do not copy-paste this ! It's MY project ! Put yours !!

## Seed for the RNG
my_seed <- 1337L

## Empty droplets max p-value
max_p <- 1E-03


4 Prepare the data structure

  • During the current and remaining training sessions, we will work on different data objects, some retrieved from external resources, some we will create.

  • In order to reuse these for further steps, or other sessions, we need to store them, thus create a directory structure on the cluster.

  • We will create this structure in your project folder (the one you registered at the IFB Core, in which you stored your own data for the tutoring).

  • To ensure that you will always use the same structure for your work, the code hereafter has been writen so that you just have to define a single variable that corresponds to your project name.

    • In my case, as a teacher for the SC training, it is ebaii_sc_teachers.

4.1 Main directory

  • Create the training course directory
## Prepare the path
TD_dir <- paste0("/shared/projects/", project_name, "/SC_TD")

## Create the root directory
dir.create(path = TD_dir, recursive = TRUE)

## Print the root directory on-screen
print(TD_dir)
[1] "/shared/projects/ebaii_sc_teachers/SC_TD"

4.2 Current session

## Create the session (Preproc.1) directory
session_dir <- paste0(TD_dir, "/01_Preproc.1")
dir.create(path = session_dir, recursive = TRUE)

## Print the session directory on-screen
print(session_dir)
[1] "/shared/projects/ebaii_sc_teachers/SC_TD/01_Preproc.1"

4.3 Input data directory

## Create the INPUT data directory
input_dir <- paste0(session_dir, "/DATA")
dir.create(path = input_dir, recursive = TRUE)

## Print the input directory on-screen
print(input_dir)
[1] "/shared/projects/ebaii_sc_teachers/SC_TD/01_Preproc.1/DATA"

4.4 Output results directory

## Create the OUTPUT data directory
output_dir <- paste0(session_dir, "/RESULTS")
dir.create(path = output_dir, recursive = TRUE)

## Print the output directory on-screen
print(output_dir)
[1] "/shared/projects/ebaii_sc_teachers/SC_TD/01_Preproc.1/RESULTS"



Note : Actually, for this session we won’t export any result, so creating this RESULTS folder was just for practice.


Now, the we have everything needed to :

  • load data (input_dir)

  • work on it (in memory)

  • save the results we will generate (output_dir).



5 Load a 10X Cell Ranger raw count matrix

5.1 The dataset

  • The data we will work on is hosted on Zenodo. To ease its retrieval, we can define its ID as a variable

    ## This is actually to help me updating this training without hassle
    zen_id <- "14034221"
  • Here will we train ourselves to load into R a single cell RNAseq data produced by 10X’s Cell Ranger.

  • We will work with a public dataset provided by the manufacturer, that consists in ~ 10,000 PBMC (peripheral bone marrow cells) from a human donor (reduced to 1/5th).

  • The experiment was performed with the 3’ capture kit v3.

  • The analysis was performed with Cell Ranger v3, mapping on the GRCh38-2020-A manufacturer reference.

  • The data consists into the typical 10x 3-files structure, hosted in a Zenodo respository (Id : 14034221)

    • barcodes.tsv.gz : the list of putative droplet barcodes
    • features.tsv.gz : the list of features (genes)
    • matrix.mtx.gz : the feature x barcode expression count matrix

5.2 Download data

  • The code chunk here after is written to :

    • let you download the data files from

      • Zenodo by default

      • a local backup if backup is set to TRUE (if we loose the internet connection)

    • if the desired data is already available locally :

      • load this rather than download

      • force a new-download if force is set to TRUE

### Named files (will be used later on !)
mtx_file <- "matrix.mtx.gz"
features_file <- "features.tsv.gz"
barcodes_file <- "barcodes.tsv.gz"
summary_file <- "pbmc_10k_v3_summary.html"

## Filename(s) to retrieve
toget_files <- c(mtx_file,
                 features_file,
                 barcodes_file,
                 summary_file)

## Folder to store retrieved files
local_folder <- input_dir

## Use local backup ?
backup <- FALSE
if(backup) message("Using local backup !")

## Force download ?
force <- FALSE
if(force) message("Forcing (re)download !")

### Define remote folder
remote_folder <- if (backup) "/shared/projects/2422_ebaii_n1/atelier_scrnaseq/TD/BACKUP/10X/" else paste0("https://zenodo.org/records/", zen_id, "/files/")

### Reconstruct the input paths
remote_path <- paste0(remote_folder, "/", toget_files)

### Reconstruct the output paths
local_path <- paste0(local_folder, "/", toget_files)

## Retrieve files (if they don't exist), in loop
for (tg in seq_along(toget_files)) {
  ## If the file does not locally exist
  if (!file.exists(local_path[tg]) | force) {
    ## Retrieve data
    if(backup) {
      file.copy(from = remote_path[tg],
                to = local_path[tg])
    } else {
      download.file(url = remote_path[tg], 
                    destfile = local_path[tg])
    }
    ## Check if downloaded files exist locally
    if(file.exists(local_path[tg])) message("\tOK")
  } else message(paste0(toget_files[tg], " already downloaded !"))
}
matrix.mtx.gz already downloaded !
features.tsv.gz already downloaded !
barcodes.tsv.gz already downloaded !
pbmc_10k_v3_summary.html already downloaded !

5.3 Load into R

  • Load the 10X data files into R :

    How ?
    • A good starter
    ## https://lmddgtfy.net/?q=load%2010x%20data%20in%20R
    • Back to your local source
    ## Reading the function help page
    ?Seurat::Read10X




    ## Loading unfiltered 10X matrix
    scmat <- Seurat::Read10X(data.dir = input_dir)


  • However, we don’t know what the loaded object actually looks like in R…

  • Questions :

    • Do you know a way to …
    ... know what IS the type of object that was created ?


    ## To know the type of an R object : methods::is()
    ?methods::is
    ## Get the type
    methods::is(scmat)

    Show output

    [1] "dgCMatrix"     "CsparseMatrix" "dsparseMatrix" "generalMatrix"
    [5] "AnyMatrix"     "V3Matrix"      "dMatrix"       "sparseMatrix" 
    [9] "Matrix"       



    ... observe the STRucture of the resulting object ?


    ## To get a basic structure description : utils::str()
    ?utils::str
    ## Get the basic structure
    utils::str(scmat)

    Show output

    Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
      ..@ i       : int [1:1718404] 7922 18295 23126 24791 31764 32866 13690 17058 2511 33306 ...
      ..@ p       : int [1:147457] 0 0 0 0 0 0 6 7 8 8 ...
      ..@ Dim     : int [1:2] 33694 147456
      ..@ Dimnames:List of 2
      .. ..$ : chr [1:33694] "RP11-34P13.3" "FAM138A" "OR4F5" "RP11-34P13.7" ...
      .. ..$ : chr [1:147456] "TCCCGATAGCCTTGAT-1" "AGATCTGAGACTGTAA-1" "GCTTGAAGTGCGAAAC-1" "ATCATGGGTATTCGTG-1" ...
      ..@ x       : num [1:1718404] 1 1 1 1 1 1 1 1 1 1 ...
      ..@ factors : list()


5.4 Features cleanup

Seurat does not like special underscores (_) in feature names …

We will have to :

  • control if our dataset contains some ?

    ## Check feature names with underscore(s)
    ### Here we use base::grep that allows to look for character strings (x) 
    ### containing a defined set of character(s) (pattern), returning their 
    ### position (value = FALSE) or value (value = TRUE)
    base::grep(pattern = '_', 
         x = rownames(scmat), 
         value = TRUE)

    Show output

     [1] "RP11-442N24__B.1" "RP11-99J16__A.2"  "RP11-59D5__B.2"   "RP11-445L13__B.3"
     [5] "RP11-544L8__B.4"  "XXyac-YX65C7_A.2" "XXyac-YX65C7_A.3" "RP11-524D16__A.3"
     [9] "XX-DJ76P10__A.2"  "RP11-1157N2__B.2" "RP1-213J1P__B.1"  "RP1-213J1P__B.2" 
    [13] "RP4-633O19__A.1"  "RP4-754E20__A.5"  "CTA-280A3__B.2"  


  • if so, clean it

    ## Clean features : replacing underscores by dashes
    base::rownames(scmat) <- base::gsub(
      pattern = "_", 
      replacement = "-", 
      x = rownames(scmat))
  • control the efficiency of our cleaning

    ## Post-cleanup control
    grep(pattern = "_", 
         x = rownames(scmat), 
         value = TRUE)

    Show output

    character(0)




6 Empty droplets

  • While a large part of our barcodes contain no count at all, barcodes with at least one total count may not mandatorily correspond to a real cell, due to ambient RNA.

  • Here we want to identify empty droplets, then remove them from the matrix.

6.1 Detection

  • To detect the margins of “true” cells versus empty droplets, we will rely on the kneeplot : a plot of UMI counts per barcode, in function of their increasingly ordered rank.

  • Thus, to perform the kneeplot, we need to rank the barcodes thanks to their total counts.

6.1.1 Ranking barcodes

  • We will use the DropletUtils package

    ## Read the function help page
    ?DropletUtils::barcodeRanks
    ## Generate the rank statistics
    bc_rank <- DropletUtils::barcodeRanks(scmat)
  • Check the ranking

    ## Show a summary of the ranks (ordered for display convenience)
    print(bc_rank[order(bc_rank$rank),])
    DataFrame with 147456 rows and 3 columns
                            rank     total    fitted
                       <numeric> <integer> <numeric>
    ATCATGGCAGCCAATT-1         1     21195        NA
    GGGCATCGTAGCTTGT-1         2     18193        NA
    AGTGGGATCTTAACCT-1         3     16966        NA
    GTCGGGTGTGCAGTAG-1         4     16383        NA
    TTAGGCAAGCCGCCTA-1         5     15406        NA
    ...                      ...       ...       ...
    TTGAACGTCAGAGGTG-1    100998         0        NA
    ACGATGTAGTTGTCGT-1    100998         0        NA
    GTCAAGTTCAGTTGAC-1    100998         0        NA
    TCGCGAGGTTGCGCAC-1    100998         0        NA
    TATCTCAGTTACCAGT-1    100998         0        NA


6.1.2 Kneeplot

Now, we can draw our kneeplot :

## Plot
graphics::plot(bc_rank$rank, 
               bc_rank$total + 1, 
               log = "xy", 
               xlab = "Barcode rank", 
               ylab = "Total UMIs", 
               col = "black", 
               pch = ".", 
               cex = 5, 
               main = "Kneeplot")



  • Question :

    Looking at the kneeplot, can you roughly predict how many barcodes will be kept as cells (ie, as non-empty droplets) ?


    ## . The "cliff" in the kneeplot is at a 
    ##  rank value close to ~ 1,000 counts.

Then, we can use these ranks to identify “true” cells :

## Read the function help page
?DropletUtils::emptyDrops
## Set the RNG seed
set.seed(my_seed)

## Identify empty droplets
bc_rank2 <- DropletUtils::emptyDrops(scmat)

We can show a summary of ordered barcodes (qualified on top) :

## Show a summary of the qualified ranks (ordered for display convenience)
print(bc_rank2[order(bc_rank2$Total, decreasing = TRUE),])
Show output
DataFrame with 147456 rows and 5 columns
                       Total   LogProb    PValue   Limited       FDR
                   <integer> <numeric> <numeric> <logical> <numeric>
ATCATGGCAGCCAATT-1     21195  -13101.3 9.999e-05      TRUE         0
GGGCATCGTAGCTTGT-1     18193  -12055.5 9.999e-05      TRUE         0
AGTGGGATCTTAACCT-1     16966  -11594.8 9.999e-05      TRUE         0
GTCGGGTGTGCAGTAG-1     16383  -11994.6 9.999e-05      TRUE         0
TTAGGCAAGCCGCCTA-1     15406  -10687.9 9.999e-05      TRUE         0
...                      ...       ...       ...       ...       ...
TTGAACGTCAGAGGTG-1         0        NA        NA        NA        NA
ACGATGTAGTTGTCGT-1         0        NA        NA        NA        NA
GTCAAGTTCAGTTGAC-1         0        NA        NA        NA        NA
TCGCGAGGTTGCGCAC-1         0        NA        NA        NA        NA
TATCTCAGTTACCAGT-1         0        NA        NA        NA        NA


6.1.3 Selection

Barcodes considered as empty have no p-value (FDR = NA)

## Creating a validation column (empty droplets have NA as p-value)
bc_rank2$VALID <- !is.na(bc_rank2$FDR) 

## Quantify "real cells"
table(bc_rank2$VALID)
Show output

 FALSE   TRUE 
146397   1059 


We can control the stringency by setting a FDR-adjusted p-value threshold

## We've set the max p-value to 1E-03
bc_rank2$VALID[bc_rank2$FDR >= max_p] <- FALSE

## Control
table(bc_rank2$VALID)
Show output

 FALSE   TRUE 
146581    875 


Now, we can display our true cells on the kneeplot :

## Plot
graphics::plot(bc_rank$rank, 
               bc_rank$total+1, 
               log = "xy", 
               xlab = "Barcode rank", 
               ylab = "Total UMIs", 
               col = ifelse(bc_rank2$VALID, "red", "black"), 
               pch = ".", 
               cex = ifelse(bc_rank2$VALID, 7, 3), 
               main = paste0("Kneeplot (", 
                             length(which(bc_rank2$VALID)), 
                             " barcodes kept)"))
Show plot


  • Question :

    Is there something unexpected for the retained "TRUE" cells in this kneeplot ?


    ## . You may observe that the selection of "true" cells (red)
    ##   does not strictly correspond to a "cut" in the kneeplot
    ##   descending curve.
    ##
    ## . This is because emptyDrops does not only "follow" the curve
    ##   to determine a threshold corresponding to a "minimal count" 
    ##   value to consider a barcode as a cell, but also performs a 
    ##   statistical analysis for each barcode, considering how much 
    ##   its expression profile ressembles the one of other cells, 
    ##   even with a very different (lower) global level of expression.



6.2 Removal

We just need to restrict our count matrix to “true” cells.

  • How many do we have at the moment ?

    dim(scmat)

    Show output

    [1]  33694 147456


  • Filter empty droplets out :

    ## Restrict to valid barcodes
    scmat_cells <- scmat[, bc_rank2$VALID]
    
    ## Control
    dim(scmat_cells)

    Show output

    [1] 33694   875




7 Ambient RNA

  • The counts measured in this matrix of (now considered “true”) single cells does not reflects the measurement of these cells expression only

  • Due to other over-lysed / dying cells in the emulsion, we also measured fraction of ambient RNA that adds up to the “real” expression

  • We may estimate it by observing counts existing at a minimal level across cells that greatly differ in their expression profile

7.1 Estimate the “soup”

SoupX can estimate the rate and composition of ambient RNA, in several ways :

  • Manual mode : providing a list of features expected to reflect the soup (genes expressed at high levels in a majority of the expected cell types)

  • Automatic modes, using the empty droplets-filtered raw count matrix …

    • … along with the unfiltered matrix (ie, containing all quantified droplets). This is the most efficient mode.

    • solely : this is less efficient, but useful when the unfiltered matrix is not available.

## Reading the function help page
?SC.helper::SoupX_auto

Let’s run SoupX :

## In case of error :
# install.packages("irlba", type="source", force=TRUE)

## SoupX
soup_frac <- SC.helper::SoupX_auto(
  scmat_filt = scmat_cells, 
  scmat_raw = scmat)
Show output

Soup-contributing features (Top 20) :


|       |       est| counts|
|:------|---------:|------:|
|MALAT1 | 0.0321703|  17473|
|B2M    | 0.0194002|  10537|
|TMSB4X | 0.0165850|   9008|
|EEF1A1 | 0.0129267|   7021|
|RPL21  | 0.0103251|   5608|
|RPS27  | 0.0100269|   5446|
|RPL13  | 0.0091542|   4972|
|RPL13A | 0.0087252|   4739|
|RPL10  | 0.0082852|   4500|
|RPLP1  | 0.0079225|   4303|
|RPL34  | 0.0078801|   4280|
|RPS12  | 0.0076481|   4154|
|RPS6   | 0.0075211|   4085|
|RPS18  | 0.0074530|   4048|
|RPS27A | 0.0073093|   3970|
|RPS2   | 0.0072965|   3963|
|RPS3A  | 0.0072909|   3960|
|RPL32  | 0.0072744|   3951|
|RPL41  | 0.0067865|   3686|
|RPL11  | 0.0063446|   3446|


NOTE : We can observe MALAT1 as the top first gene.


What’s the estimated fraction of soup in our counts ?

cat("Soup fraction : ", soup_frac)
Show output
Soup fraction :  0.054


  • Question :

    Should we try to remove such an amount of ambient RNA ?


    The best way to know is to try it out !



7.2 Remove the “soup”

SoupX can remove the soup (subtract counts for identified soup genes to all barcodes/cells) using different methods :

  • Automatically (safer)

  • Specifying a fraction of counts to remove.

    • Pros :

      • By removing X % of reads :

        • if soup was effectively present (at a rate <= X), this will mostly affect soup genes first.

        • If not, it will decrease counts globally, thus have negligible effect in differences between cells / cell types.

    • Cons :

      • Counts are already rare in droplet-based single cell technologies… Removing them “by default” makes them even more rare

      • If actual soup rate is higher than the defined rate X to remove, this mode will remain uneffective

## Run SoupX in "removal" mode
scmat_unsoup <- SC.helper::SoupX_auto(
  scmat_filt = scmat_cells, 
  scmat_raw = scmat, 
  return_object = TRUE)
Show output
Counts BEFORE SoupX :  3766216 
Show output
Counts AFTER SoupX :  3563040 
## We can remove the barcode matrix to free some RAM
rm(scmat)


7.3 Visualize the “soup” removal

  • Now, we want to characterize the effect of the removal of the ambient RNA, by comparing the pre- and post- soupX matrices.

  • In this purpose, we will construct a Seurat object from each of these matrices, that we will describe, then represent graphically.

  • But how to create a Seurat object from a count matrix ?

## Reading the function help page
?Seurat::CreateSeuratObject




7.3.1 Create and describe the objects

7.3.1.1 Before SoupX

  • Create the Seurat object
## Create the Seurat object for unsouped cells
sobj <- Seurat::CreateSeuratObject(
  counts = scmat_cells, 
  project = "PBMC10K_BeforeSoupX")
  • Describe

    • We will use a function prepared for you : SC.helper::SeuratObject_descriptor
## Descibe the obtained Seurat object
SC.helper::SeuratObject_descriptor(sobj = sobj)
Show output
OBJECT VERSION :    5.0.2 
PROJECT :   [PBMC10K_BeforeSoupX] 

[ASSAYS]
   ASSAY 1 :    [RNA] [ACTIVE] 
      SLOT/LAYER 1 :    [counts]    Dims:[33694 x 875]  Range:[0.00-724.00] 
         Counts :   3766216 
         Sparsity : 96.14275% 
Show output
      SLOT/LAYER 2 :    [data]  Dims:[0 x 0] 
         Sparsity : NaN% 
Show output
      SLOT/LAYER 3 :    [scale.data]    Dims:[0 x 0] 

[DIMREDS]

[BARCODES METADATA]
orig.ident             Freq
--------------------  -----
PBMC10K_BeforeSoupX     875
NA                        0

 nCount_RNA 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    116    3069    3957    4304    5014   21195 

 nFeature_RNA 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     89    1052    1242    1300    1484    3752 


7.3.1.2 After SoupX

  • Create the Seurat object
## Create the Seurat object for unsouped cells
sobj_unsoup <- Seurat::CreateSeuratObject(
  counts = scmat_unsoup, 
  project = "PBMC10K_AfterSoupX")
  • Describe
## Describe the new object content
SC.helper::SeuratObject_descriptor(sobj = sobj_unsoup)
Show output
OBJECT VERSION :    5.0.2 
PROJECT :   [PBMC10K_AfterSoupX] 

[ASSAYS]
   ASSAY 1 :    [RNA] [ACTIVE] 
      SLOT/LAYER 1 :    [counts]    Dims:[33694 x 875]  Range:[0.00-699.00] 
         Counts :   3563040 
         Sparsity : 96.35223% 
Show output
      SLOT/LAYER 2 :    [data]  Dims:[0 x 0] 
         Sparsity : NaN% 
Show output
      SLOT/LAYER 3 :    [scale.data]    Dims:[0 x 0] 

[DIMREDS]

[BARCODES METADATA]
orig.ident            Freq
-------------------  -----
PBMC10K_AfterSoupX     875
NA                       0

 nCount_RNA 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    113    2912    3735    4072    4754   19797 

 nFeature_RNA 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     88    1002    1175    1229    1400    3435 





  • Questions :

    Can you describe the difference (especially for counts and sparsity) ?
    
    Is it expected ?


    ## . Total counts, max value, counts per cell and number of
    ##   expressed features per cell, all have decreased.
    ##
    ## . The sparsity level has slightly increased (removing the soup
    ##   obliterated all counts for some features in some cells).
    ##
    ## . All of this is absolutely expected.


What are the features most affected by the soup removal ?

## Compute the fraction matrix (PRE / POST)
##  NOTE : we have 0 counts in the divider, 
##  so we increment both matrix by +1 !
cont_frac <- (scmat_cells + 1) / (scmat_unsoup + 1)

## Fraction of counts decreased by soup removal, per feature (ordered)
feat_frac <- sort(sparseMatrixStats::rowMeans2(cont_frac), decreasing = TRUE)

### Display Top 5 features
utils::head(feat_frac, decreasing = TRUE)
Show output
     LYZ   S100A9   S100A8  HLA-DRA     CD74     CST3 
1.478310 1.453317 1.384649 1.343360 1.279586 1.204137 
## Removing objects to free some RAM
rm(cont_frac, feat_frac)


## Remove unused objects to free some RAM
rm(scmat_cells, scmat_unsoup)


7.3.2 Plot the “top 1” feature

  • We coded a small R function to help you visualize in 2D your data, even at such a very early step of the analysis.

  • This function will analyze your data (as a Seurat object) in the shadows with default values, and generate a 2D representation called UMAP.

  • As default values are used, they should fit most cases. However, do not fully trust by default its results as a pure ground truth, as your data may not fit these defaults… That’s why the function is nammed “QnD_viz”, as in “Quick and dirty visualization”.

## Reading the function help page
?SC.helper::QnD_viz

7.3.2.1 Before SoupX

## Top1 feature 
umap_pre <- SC.helper::QnD_viz(
  sobj = sobj, 
  features = "LYZ",
  return_plot = TRUE)
Show plot


7.3.2.2 After SoupX

## Top1 feature 
umap_post <- SC.helper::QnD_viz(
  sobj = sobj_unsoup, 
  features = "LYZ",
  return_plot = TRUE)
Show plot




Merging plots for ease of use :

## Using the patchwork package to merge plots (and ggplot2 to add titles)
patchwork::wrap_plots(
  list(
    umap_pre & ggplot2::ggtitle(label = "LYZ before SoupX"), 
    umap_post & ggplot2::ggtitle(label = "LYZ adter SoupX")), 
  nrow = 1)
Show plot


  • Question

    Can you compare the two plots :
      - for the LYZ expression globally ?
      - for the LYZ expression across the 2-D space ?
      - about the space topologies ?


    ## . The level expression measured in cells with smaller expression
    ##   before SoupX has gone to almost none after : the ambient expression
    ##   of this gene has efficiently been removed.
    ##
    ## . Hopefully, the cluster(s) with higher expression remain(s) high.
    ##
    ## . The expression scale upper bound after SoupX is higher than before ?! 
    ##   But we REMOVED some counts ?! While being counter-intuitive, this is
    ##   a positive consequence of the ambient removal on the scaling of
    ##   barcodes expression (see later). 
    ##
    ## . The global topology of clusters remains close (but not identical).







Beyond :

  1. Save the sobj R object on disk.
## A single object : use RDS format with saveRDS()
saveRDS(object = sobj, file = paste0(output_dir, "/sobj.RDS"))
  1. Save both the sobj and sobj_unsoup objects in a single archive on disk.
## Multiple objects : use Rdata format with save()
save(list = c(sobj, sobj_unsoup), file = paste0(output_dir, "/sobjects.RDA"))
  1. Same as 1. but use the “bzip2” compression algorithm
## Compress with bzip2
saveRDS(object = sobj, file = paste0(output_dir, "/sobj.RDS"), compress = "bzip2")







8 Rsession

utils::sessionInfo()
Show output
R version 4.4.1 (2024-06-14)
Platform: x86_64-conda-linux-gnu
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS/LAPACK: /shared/ifbstor1/software/miniconda/envs/r-4.4.1/lib/libopenblasp-r0.3.27.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Paris
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
  [1] matrixStats_1.4.1           spatstat.sparse_3.1-0      
  [3] SC.helper_0.0.6             httr_1.4.7                 
  [5] RColorBrewer_1.1-3          doParallel_1.0.17          
  [7] alabaster.base_1.4.1        tools_4.4.1                
  [9] sctransform_0.4.1           backports_1.5.0            
 [11] utf8_1.2.4                  R6_2.5.1                   
 [13] HDF5Array_1.32.0            lazyeval_0.2.2             
 [15] uwot_0.2.2                  rhdf5filters_1.16.0        
 [17] GetoptLong_1.0.5            withr_3.0.1                
 [19] sp_2.1-4                    gridExtra_2.3              
 [21] progressr_0.14.0            cli_3.6.3                  
 [23] Biobase_2.64.0              spatstat.explore_3.3-2     
 [25] fastDummies_1.7.4           labeling_0.4.3             
 [27] alabaster.se_1.4.1          sass_0.4.9                 
 [29] Seurat_5.1.0                spatstat.data_3.1-2        
 [31] ggridges_0.5.6              pbapply_1.7-2              
 [33] foreign_0.8-86              R.utils_2.12.3             
 [35] parallelly_1.38.0           limma_3.60.6               
 [37] rstudioapi_0.17.0           RSQLite_2.3.7              
 [39] generics_0.1.3              shape_1.4.6.1              
 [41] ica_1.0-3                   spatstat.random_3.3-2      
 [43] dplyr_1.1.4                 Matrix_1.7-1               
 [45] fansi_1.0.6                 S4Vectors_0.42.1           
 [47] abind_1.4-8                 R.methodsS3_1.8.2          
 [49] lifecycle_1.0.4             SoupX_1.6.2                
 [51] yaml_2.3.10                 edgeR_4.2.2                
 [53] SummarizedExperiment_1.34.0 BiocFileCache_2.12.0       
 [55] rhdf5_2.48.0                SparseArray_1.4.8          
 [57] Rtsne_0.17                  grid_4.4.1                 
 [59] blob_1.2.4                  promises_1.3.0             
 [61] dqrng_0.4.1                 ExperimentHub_2.12.0       
 [63] crayon_1.5.3                miniUI_0.1.1.1             
 [65] lattice_0.22-6              beachmat_2.20.0            
 [67] cowplot_1.1.3               KEGGREST_1.44.0            
 [69] pillar_1.9.0                knitr_1.48                 
 [71] ComplexHeatmap_2.20.0       metapod_1.12.0             
 [73] GenomicRanges_1.56.2        rjson_0.2.21               
 [75] future.apply_1.11.2         codetools_0.2-20           
 [77] leiden_0.4.3.1              glue_1.8.0                 
 [79] spatstat.univar_3.0-1       data.table_1.16.2          
 [81] gypsum_1.0.1                vctrs_0.6.5                
 [83] png_0.1-8                   spam_2.11-0                
 [85] gtable_0.3.5                cachem_1.1.0               
 [87] xfun_0.48                   S4Arrays_1.4.1             
 [89] mime_0.12                   DropletUtils_1.24.0        
 [91] survival_3.7-0              SingleCellExperiment_1.26.0
 [93] iterators_1.0.14            statmod_1.5.0              
 [95] bluster_1.14.0              fitdistrplus_1.2-1         
 [97] ROCR_1.0-11                 nlme_3.1-165               
 [99] bit64_4.5.2                 alabaster.ranges_1.4.1     
[101] filelock_1.0.3              RcppAnnoy_0.0.22           
[103] GenomeInfoDb_1.40.1         bslib_0.8.0                
[105] irlba_2.3.5.1               KernSmooth_2.23-24         
[107] rpart_4.1.23                colorspace_2.1-1           
[109] BiocGenerics_0.50.0         DBI_1.2.3                  
[111] Hmisc_5.1-3                 celldex_1.14.0             
[113] nnet_7.3-19                 tidyselect_1.2.1           
[115] curl_5.2.3                  bit_4.5.0                  
[117] compiler_4.4.1              httr2_1.0.1                
[119] htmlTable_2.4.2             BiocNeighbors_1.22.0       
[121] DelayedArray_0.30.1         plotly_4.10.4              
[123] checkmate_2.3.1             scales_1.3.0               
[125] lmtest_0.9-40               rappdirs_0.3.3             
[127] stringr_1.5.1               digest_0.6.37              
[129] goftest_1.2-3               spatstat.utils_3.1-0       
[131] alabaster.matrix_1.4.1      rmarkdown_2.28             
[133] XVector_0.44.0              htmltools_0.5.8.1          
[135] pkgconfig_2.0.3             base64enc_0.1-3            
[137] SingleR_2.6.0               sparseMatrixStats_1.16.0   
[139] MatrixGenerics_1.16.0       dbplyr_2.5.0               
[141] highr_0.11                  fastmap_1.2.0              
[143] rlang_1.1.4                 GlobalOptions_0.1.2        
[145] htmlwidgets_1.6.4           UCSC.utils_1.0.0           
[147] shiny_1.9.1                 DelayedMatrixStats_1.26.0  
[149] farver_2.1.2                jquerylib_0.1.4            
[151] zoo_1.8-12                  jsonlite_1.8.9             
[153] BiocParallel_1.38.0         R.oo_1.26.0                
[155] BiocSingular_1.20.0         magrittr_2.0.3             
[157] Formula_1.2-5               scuttle_1.14.0             
[159] GenomeInfoDbData_1.2.12     dotCall64_1.2              
[161] patchwork_1.3.0             Rhdf5lib_1.26.0            
[163] munsell_0.5.1               Rcpp_1.0.13                
[165] reticulate_1.39.0           alabaster.schemas_1.4.0    
[167] stringi_1.8.4               zlibbioc_1.50.0            
[169] MASS_7.3-61                 AnnotationHub_3.12.0       
[171] plyr_1.8.9                  parallel_4.4.1             
[173] listenv_0.9.1               ggrepel_0.9.6              
[175] deldir_2.0-4                Biostrings_2.72.1          
[177] splines_4.4.1               tensor_1.5                 
[179] circlize_0.4.16             locfit_1.5-9.9             
[181] igraph_2.1.1                spatstat.geom_3.3-3        
[183] RcppHNSW_0.6.0              reshape2_1.4.4             
[185] stats4_4.4.1                ScaledMatrix_1.12.0        
[187] BiocVersion_3.19.1          evaluate_1.0.1             
[189] SeuratObject_5.0.2          BiocManager_1.30.25        
[191] scran_1.32.0                foreach_1.5.2              
[193] httpuv_1.6.15               RANN_2.6.2                 
[195] tidyr_1.3.1                 purrr_1.0.2                
[197] polyclip_1.10-7             future_1.34.0              
[199] clue_0.3-65                 scattermore_1.2            
[201] ggplot2_3.5.1               rsvd_1.0.5                 
[203] xtable_1.8-4                RSpectra_0.16-2            
[205] later_1.3.2                 viridisLite_0.4.2          
[207] tibble_3.2.1                memoise_2.0.1              
[209] AnnotationDbi_1.66.0        IRanges_2.38.1             
[211] cluster_2.1.6               globals_0.16.3             
---
title: "<CENTER>EBAII n1 2024 : SINGLE CELL ANALYSIS TRAINING<BR> <B>PREPROCESSING (I)</B><BR>Load a matrix, empty droplets and ambient RNA filtering</CENTER>"
date: "2024-11-17.22"
author: "EBAII n1 scRNAseq Team"
output:
  html_document: 
    background: black
    fig_height: 6
    fig_width: 8
    highlight: tango  ## Theme for the code chunks
    number_sections: true  ## Adds number to headers (sections)
    theme: flatly  ## CSS theme for the HTML page
    toc: true  ## Adds a table of content
    toc_float:  ## TOC options
      collapsed: true  ## By default, the TOC is folded
      smooth_scroll: true ## Smooth scroll of the HTML page
    self_contained: true ## Includes all plots/images within the HTML
    code_download: true ## Adds a button to download the Rmd
    code_folding: show
    thumbnails: false
    lightbox: true
    fig_caption: false
    gallery: true
    use_bookdown: true
always_allow_html: true ## Allow plain HTML code in the Rmd
editor_options: 
  markdown: 
    wrap: 72
---

<!-- Add the Roscoff banner -->

```{css, echo = FALSE}
body {
  background-image: url('ebaii_banner.png');
  background-repeat: no-repeat;
  background-size: 100%;
}

div {
  background-color: rgba(255, 255, 255, 0.35)   /* 35% opaque white */;
  padding: 0.25em;
}
```

<!-- Allows to hide the TOC by default, display it with a button, move it to the right or left of the page -->

`r Hmisc::hidingTOC(buttonLabel = 'Show TOC', hidden = TRUE, tocSide = 'left', buttonSide='left', posCollapse = 'margin', levels = 3)`

```{r knit_setup, echo = FALSE}
# options(width = 60);
knitr::opts_chunk$set(
  echo = TRUE,        # Print the code
  eval = TRUE,        # Run command lines
  message = FALSE,    # Print messages
  prompt = FALSE,     # Do not display prompt
  comment = NA,       # No comments on this section
  warning = FALSE,    # Display warnings
  tidy = FALSE,
  fig.align="center", 
  # results = 'hide',
  width = 100       # Number of characters per line
)
```

<!-- Hook to handle output folding -->

```{r knit_hook, echo = FALSE}
hooks = knitr::knit_hooks$get()
hook_foldable = function(type) {
  force(type)
  function(x, options) {
    res = hooks[[type]](x, options)
    
    if (isFALSE(options[[paste0("fold.", type)]])) return(res)
    
    paste0(
      "<details><summary>Show ", type, "</summary>\n\n",
      res,
      "\n\n</details>"
    )
  }
}
knitr::knit_hooks$set(
  output = hook_foldable("output"),
  plot = hook_foldable("plot")
)
```

<!-- CSS to color chunks and outputs -->

```{css, echo=FALSE}
.notrun {
  background-color: lightgrey !important;
  border: 3px solid black !important;
}
.notruno {
  background-color: lightgrey !important;
  color : black !important;
}
.question {
  background-color: aquamarine !important;
  color : black !important;
  border: 3px solid limegreen !important;
}
.questiono {
  background-color: aquamarine !important;
  color : black !important;
}
.answer {
  background-color: navajowhite !important;
  border: 3px solid brown !important;
}
.answero {
  background-color: navajowhite !important;
  color : black !important;
}
.beyond {
  background-color: violet !important;
  border: 3px solid purple !important;
}
.beyondo {
  background-color: violet !important;
  color : black !important;
}
```

------------------------------------------------------------------------

------------------------------------------------------------------------

# PREAMBLE

## Purpose of this session

This file describes the different steps to perform the **first part** of
the **single cell RNAseq** data analysis training course for the **EBAII
n1 2024**, covering these steps :

-   **Load** a single cell raw counts **matrix**

-   **Remove** barcodes from **empty droplets**

-   **Esimate** and remove **ambient RNA** contamination *("soup")*

## An important notice

<center>![](gyro.png)</center>

-   These early (but mandatory) steps of the analysis are not covered by
    the renowned, higher-level analysis frameworks like **Seurat** nor
    **scran/scater**.
    
-   Thus, we will write/use a certain amount of **lines of code**, some
    of which of a higher complexity for R beginners.
    
-   **Newcomer, do not fear** !

    -   While you are **welcome to try** writing some by yourself,
        depending on your R knowledge and self-confidence in it ...
        
    -   ... you do not have to know [**how**]{.underline} these code
        work ...
        
    -   ... thus you **should not feel ashamed** copy-pasting the
        pre-written code hereafter ...
        
    -   ... but in both cases, please try **your best understanding**
        [**what**]{.underline} these code do, and even more,
        [**why**]{.underline} !
        
    -   In this purpose, you may use whatever **help** you can **by**
        **searching**.

## What to run ?

-   This training presentation, written in Rmarkdown, contains **many** chunks ( = code blocks)

-   You **do not have to** run them all !

-   Chunks follow a **color code** :

  -   Chunks you are expected to run :

      ```{r chunk_run}
      ## Example of chunk to run
      getwd()
      ```
      
      <br>

  -   Chunks you are **NOT** expected to run (we will demo them). You may run them but if you are late afterwards, _you_'ll be the one to blame !

      ```{r chunk_demo, class.source="notrun", class.output="notruno"}
      ## Example of chunk to NOT run
      getwd()
      ```

      <br>

  -   **Additional** exercises / questions beyond the scope of the training 
  course, you may try if you feel bored

      ```{r chunk_beyond, class.source="beyond", class.output="beyondo"}
      ## Example of chunk to play with but beyond the scope
      getwd()
      ```

      <br>

  -   **Questions** we would like you to answer during the training course

      ```{r chunk_question, class.source="question", eval = FALSE}
      What's the Answer to Life ? The Universe ? Everything ??
      ```

      <br>

  -   **Answers** to these

      ```{r chunk_answer, class.source = c("fold-hide", "answer"), class.output="answero"}
      cat("The Answer to the Great Question... is ... 
      Forty-two. Six by nine : forty-two. That's it. 
      That's all there is. (I always thought something was 
      fundamentally wrong with the universe.)")
      ```

------------------------------------------------------------------------

------------------------------------------------------------------------

# Start Rstudio

-   Using the [OpenOnDemand cheat
    sheet](https://ifb-elixirfr.github.io/EBAII/2023/ebaiin1/SingleCell/2024_TD_OpenOnDemand.html){target="_blank"},
    connect to the [OpenOnDemand
    portal](https://ondemand.cluster.france-bioinformatique.fr){target="_blank"} and
    **create a Rstudio session** with the right resource requirements.

------------------------------------------------------------------------

------------------------------------------------------------------------

# Warm-up

-   We set **common parameters** we will use throughout this session :

```{r setparam}
## Set your project name
project_name <- "ebaii_sc_teachers"  # Do not copy-paste this ! It's MY project ! Put yours !!

## Seed for the RNG
my_seed <- 1337L

## Empty droplets max p-value
max_p <- 1E-03
```

------------------------------------------------------------------------

------------------------------------------------------------------------

# Prepare the data structure

-   During the current and remaining training sessions, we will work on
    different **data** **objects**, some retrieved from **external**
    resources, some we will **create**.
    
-   In order to reuse these for further steps, or other sessions, we
    need to **store** them, thus create a **directory structure** on the
    cluster.
    
-   We will create this structure in **your project** folder (the one
    you registered at the IFB Core, in which you stored your own data
    for the tutoring).
    
-   To ensure that *you* will always use the same structure for *your*
    work, the code hereafter has been writen so that you just have to
    define a single **variable** that corresponds to your project name.
    
    -   In *my* case, as a teacher for the SC training, it is
        **`ebaii_sc_teachers`**.

## Main directory

-   Create the training course directory

```{r maindir, fold.output = FALSE}
## Prepare the path
TD_dir <- paste0("/shared/projects/", project_name, "/SC_TD")

## Create the root directory
dir.create(path = TD_dir, recursive = TRUE)

## Print the root directory on-screen
print(TD_dir)
```

## Current session

```{r sessiondir, fold.output = FALSE}
## Create the session (Preproc.1) directory
session_dir <- paste0(TD_dir, "/01_Preproc.1")
dir.create(path = session_dir, recursive = TRUE)

## Print the session directory on-screen
print(session_dir)
```

## Input data directory

```{r indir, fold.output=FALSE}
## Create the INPUT data directory
input_dir <- paste0(session_dir, "/DATA")
dir.create(path = input_dir, recursive = TRUE)

## Print the input directory on-screen
print(input_dir)
```

## Output results directory

```{r outdir, fold.output=FALSE}
## Create the OUTPUT data directory
output_dir <- paste0(session_dir, "/RESULTS")
dir.create(path = output_dir, recursive = TRUE)

## Print the output directory on-screen
print(output_dir)
```

<br><br>

[**Note :**]{.underline} *Actually, for this session we won't export
any result, so creating this **`RESULTS`** folder was just for
practice.*

<br>

Now, the we have everything needed to :

-   **load** data (`input_dir`)

-   **work** on it _(in memory)_

-   **save** the results we will generate (`output_dir`).

------------------------------------------------------------------------

------------------------------------------------------------------------

# Load a 10X Cell Ranger raw count matrix

## The dataset

-   The data we will work on is hosted on [Zenodo](https://zenodo.org){target="_blank"}. To ease its retrieval, we can define its ID as a variable

    ```{r zid}
    ## This is actually to help me updating this training without hassle
    zen_id <- "14034221"
    ```

-   Here will we train ourselves to load into R a single cell RNAseq
    data produced by 10X's **Cell Ranger**.
    
-   We will work with a public dataset provided by the manufacturer,
    that consists in \~ **10,000 PBMC** (peripheral bone marrow cells)
    from a human donor (reduced to 1/5th).
    
-   The experiment was performed with the **3' capture kit v3**.

-   The analysis was performed with **Cell Ranger v3**, mapping on the
    **GRCh38-2020-A manufacturer** reference.
    
-   The data consists into the typical 10x 3-files structure, hosted in
    a Zenodo respository (Id :
    [`r zen_id`](https://zenodo.org/records/`r zen_id` "Zenodo Preproc.1"){target="_blank"})
    
    -   `barcodes.tsv.gz` : the list of putative droplet **barcodes**
    -   `features.tsv.gz` : the list of **features** (genes)
    -   `matrix.mtx.gz` : the **feature** x **barcode** expression count
        **matrix**

## Download data

-   The code chunk here after is written to :
    
    -   let you download the data files from 
    
        -   Zenodo by default
        
        -   a local backup if `backup` is set to `TRUE` (if we loose the internet connection)
        
    -   if the desired data is already available locally :
    
        -   load this rather than download
        
        -   force a new-download if `force` is set to `TRUE`
    
```{r dlzen10x, message = TRUE, echo = TRUE}
### Named files (will be used later on !)
mtx_file <- "matrix.mtx.gz"
features_file <- "features.tsv.gz"
barcodes_file <- "barcodes.tsv.gz"
summary_file <- "pbmc_10k_v3_summary.html"

## Filename(s) to retrieve
toget_files <- c(mtx_file,
                 features_file,
                 barcodes_file,
                 summary_file)

## Folder to store retrieved files
local_folder <- input_dir

## Use local backup ?
backup <- FALSE
if(backup) message("Using local backup !")

## Force download ?
force <- FALSE
if(force) message("Forcing (re)download !")

### Define remote folder
remote_folder <- if (backup) "/shared/projects/2422_ebaii_n1/atelier_scrnaseq/TD/BACKUP/10X/" else paste0("https://zenodo.org/records/", zen_id, "/files/")

### Reconstruct the input paths
remote_path <- paste0(remote_folder, "/", toget_files)

### Reconstruct the output paths
local_path <- paste0(local_folder, "/", toget_files)

## Retrieve files (if they don't exist), in loop
for (tg in seq_along(toget_files)) {
  ## If the file does not locally exist
  if (!file.exists(local_path[tg]) | force) {
    ## Retrieve data
    if(backup) {
      file.copy(from = remote_path[tg],
                to = local_path[tg])
    } else {
      download.file(url = remote_path[tg], 
                    destfile = local_path[tg])
    }
    ## Check if downloaded files exist locally
    if(file.exists(local_path[tg])) message("\tOK")
  } else message(paste0(toget_files[tg], " already downloaded !"))
}
```

## Load into R

-   **Load** the **10X** **data** files into R :

    ```{r q_Read10X, eval = FALSE, class.source="question", eval = FALSE}
    How ?
    ```

    -   A good starter
    
    ```{r a_Read10X1, class.source = c("fold-hide", "answer"), eval = FALSE}
    ## https://lmddgtfy.net/?q=load%2010x%20data%20in%20R
    ```
    
    -   Back to your local source
    
    ```{r a_Read10X2, class.source = c("fold-hide", "answer"), eval = FALSE}
    ## Reading the function help page
    ?Seurat::Read10X
    ```
    
    <br><br><br>
    
    ```{r load10x}
    ## Loading unfiltered 10X matrix
    scmat <- Seurat::Read10X(data.dir = input_dir)
    ```
    
    <br>

-   However, we don't know what the loaded object actually looks like in R...

-   **Questions** : 

    -   Do you know a way to ...
    
    ```{r q_is, class.source="question", eval = FALSE}
    ... know what IS the type of object that was created ?
    ```
    
    <br>

    ```{r a_is, class.source = c("fold-hide", "answer"), eval = FALSE}
    ## To know the type of an R object : methods::is()
    ?methods::is
    ```
    
    ```{r gois, class.source = "fold-hide"}
    ## Get the type
    methods::is(scmat)
    ```
    
    <br><br>
    
    ```{r q_str, class.source="question", eval = FALSE}
    ... observe the STRucture of the resulting object ?
    ```
    
    <br>
    
    ```{r a_str, class.source = c("fold-hide", "answer"), eval = FALSE}
    ## To get a basic structure description : utils::str()
    ?utils::str
    ```
    
    ```{r go str, class.source = "fold-hide"}
    ## Get the basic structure
    utils::str(scmat)
    ```

<br>

## Features cleanup

Seurat does not like special **underscores** (`_`) in feature names ...

We will have to :

-   **control** if our dataset contains some ?
    
    ```{r us_check1}
    ## Check feature names with underscore(s)
    ### Here we use base::grep that allows to look for character strings (x) 
    ### containing a defined set of character(s) (pattern), returning their 
    ### position (value = FALSE) or value (value = TRUE)
    base::grep(pattern = '_', 
         x = rownames(scmat), 
         value = TRUE)
    ```
    
    <br>
    
-   if so, **clean** it

    ```{r us_clean}
    ## Clean features : replacing underscores by dashes
    base::rownames(scmat) <- base::gsub(
      pattern = "_", 
      replacement = "-", 
      x = rownames(scmat))
    ```

-   control the efficiency of our cleaning
    
    ```{r us_chek2, class.source="notrun", class.output="notruno"}
    ## Post-cleanup control
    grep(pattern = "_", 
         x = rownames(scmat), 
         value = TRUE)
    ```

<br>

------------------------------------------------------------------------

------------------------------------------------------------------------


# Empty droplets

-   While a large part of our barcodes contain **no count** at all, barcodes with at least one total count **may not** mandatorily correspond to a **real cell**, due to ambient RNA.

-   Here we want to **identify** empty droplets, then **remove** them from the matrix.

## Detection

-   To detect the margins of "true" cells versus empty droplets, we will
rely on the **kneeplot** : a plot of **UMI counts per barcode**, in function of their
increasingly **ordered rank**.

-   Thus, to perform the kneeplot, we need to **rank the barcodes** 
thanks to their total counts.

### Ranking barcodes

-   We will use the [**DropletUtils**](https://bioconductor.org/packages/release/bioc/html/DropletUtils.html){target="_blank"} package

    ```{r h_barcodeRanks, eval = FALSE}
    ## Read the function help page
    ?DropletUtils::barcodeRanks
    ```
    
    ```{r bcranks}
    ## Generate the rank statistics
    bc_rank <- DropletUtils::barcodeRanks(scmat)
    ```

-   Check the ranking

    ```{r bcrank_top, class.source="notrun", class.output="notruno", fold.output = FALSE}
    ## Show a summary of the ranks (ordered for display convenience)
    print(bc_rank[order(bc_rank$rank),])
    ```
    
<br>

### Kneeplot

Now, we can **draw** our kneeplot :

```{r kneeplot1, class.source="notrun", class.output="notruno", fold.plot = FALSE}
## Plot
graphics::plot(bc_rank$rank, 
               bc_rank$total + 1, 
               log = "xy", 
               xlab = "Barcode rank", 
               ylab = "Total UMIs", 
               col = "black", 
               pch = ".", 
               cex = 5, 
               main = "Kneeplot")
```

<br><br>

-   **Question** : 
    
    ```{r q_knee, class.source="question", eval = FALSE}
    Looking at the kneeplot, can you roughly predict how many barcodes will be kept as cells (ie, as non-empty droplets) ?
    ```
    
    <br>
    
    ```{r a_knee, class.source = c("fold-hide", "answer"), eval = FALSE}
    ## . The "cliff" in the kneeplot is at a 
    ##  rank value close to ~ 1,000 counts.
    ```
    
Then, we can use these ranks to **identify "true" cells** :

```{r h_emptyDrops, class.source="notrun", class.output="notruno", eval = FALSE}
## Read the function help page
?DropletUtils::emptyDrops
```

```{r emptydrops1}
## Set the RNG seed
set.seed(my_seed)

## Identify empty droplets
bc_rank2 <- DropletUtils::emptyDrops(scmat)
```

We can show a summary of ordered barcodes (qualified on top) :

```{r emptysmry, class.source="notrun", class.output="notruno"}
## Show a summary of the qualified ranks (ordered for display convenience)
print(bc_rank2[order(bc_rank2$Total, decreasing = TRUE),])
```

<br>

### Selection

Barcodes considered as empty have **no p-value** (`FDR` = NA)

```{r bcr2_na}
## Creating a validation column (empty droplets have NA as p-value)
bc_rank2$VALID <- !is.na(bc_rank2$FDR) 

## Quantify "real cells"
table(bc_rank2$VALID)
```

<br>

We can **control the stringency** by setting a FDR-adjusted p-value **threshold**

```{r valid_fdr}
## We've set the max p-value to 1E-03
bc_rank2$VALID[bc_rank2$FDR >= max_p] <- FALSE

## Control
table(bc_rank2$VALID)
```

<br>

Now, we can **display** our **true cells** on the kneeplot :

```{r kneeplot2, fig.align="center"}
## Plot
graphics::plot(bc_rank$rank, 
               bc_rank$total+1, 
               log = "xy", 
               xlab = "Barcode rank", 
               ylab = "Total UMIs", 
               col = ifelse(bc_rank2$VALID, "red", "black"), 
               pch = ".", 
               cex = ifelse(bc_rank2$VALID, 7, 3), 
               main = paste0("Kneeplot (", 
                             length(which(bc_rank2$VALID)), 
                             " barcodes kept)"))
```

<br>

-   **Question** :

    ```{r q_kneered, class.source="question", eval = FALSE}
    Is there something unexpected for the retained "TRUE" cells in this kneeplot ?
    ```
    
    <br>
    
    ```{r a_kneered, class.source = c("fold-hide", "answer"), eval = FALSE}
    ## . You may observe that the selection of "true" cells (red)
    ##   does not strictly correspond to a "cut" in the kneeplot
    ##   descending curve.
    ##
    ## . This is because emptyDrops does not only "follow" the curve
    ##   to determine a threshold corresponding to a "minimal count" 
    ##   value to consider a barcode as a cell, but also performs a 
    ##   statistical analysis for each barcode, considering how much 
    ##   its expression profile ressembles the one of other cells, 
    ##   even with a very different (lower) global level of expression.
    ```
    
<br><br>

## Removal

We just need to restrict our count matrix to "true" cells.

-   How many do we have at the moment ?

    ```{r scdim3, class.source="notrun", class.output="notruno"}
    dim(scmat)
    ```
    
    <br>
    
-   Filter empty droplets out :

    ```{r edfilter}
    ## Restrict to valid barcodes
    scmat_cells <- scmat[, bc_rank2$VALID]
    
    ## Control
    dim(scmat_cells)
    ```

<br>

------------------------------------------------------------------------

------------------------------------------------------------------------

# Ambient RNA

-   The counts measured in this matrix of (now considered "true") single cells
does **not** reflects the measurement of these cells expression **only**

-   Due to other **over-lysed / dying cells** in the emulsion, we also measured
   fraction of **ambient RNA** that adds up to the "real" expression
   
-   We may **estimate** it by observing counts existing at a **minimal level** across 
cells that greatly differ in their expression profile

## Estimate the "soup"

[**SoupX**](https://cran.r-project.org/web/packages/SoupX/readme/README.html){target="_blank"}
can estimate the rate and composition of ambient RNA, in several ways :

-   **Manual mode** : providing a list of features expected to reflect
    the soup (genes expressed at high levels in a majority of the
    expected cell types)
    
-   **Automatic modes**, using the empty droplets-**filtered** raw count
    matrix ...
    
    -   ... along **with the unfiltered** matrix (ie, containing all quantified
        droplets). This is the most efficient mode.
        
    -   ... **solely** : this is less efficient, but useful when the unfiltered
        matrix is not available.

```{r h_SoupX_auto, eval = FALSE}
## Reading the function help page
?SC.helper::SoupX_auto
```

Let's run SoupX :

```{r soupxrhoE}
## In case of error :
# install.packages("irlba", type="source", force=TRUE)

## SoupX
soup_frac <- SC.helper::SoupX_auto(
  scmat_filt = scmat_cells, 
  scmat_raw = scmat)
```

<br>

***NOTE*** : We can observe [**MALAT1**](https://www.biorxiv.org/content/10.1101/2024.07.14.603469v2){target="_blank"} as the top first gene.

<br>

What's the estimated fraction of soup in our counts ?

```{r soupxfrac}
cat("Soup fraction : ", soup_frac)
```

<br>

-   **Question** :

    ```{r q_souprate, class.source="question", eval = FALSE}
    Should we try to remove such an amount of ambient RNA ?
    ```
    
    <br>
    
    ```{r a_souprate, class.source = c("fold-hide", "answer"), eval = FALSE}
    The best way to know is to try it out !
    ```

<br><br>


## Remove the "soup"

SoupX can **remove** the soup (subtract counts for identified soup genes
to all barcodes/cells) using different methods :

-   **Automatically** (safer)

-   Specifying a **fraction of counts** to remove.

    -   *Pros* :
    
        -   By removing X % of reads :
        
            -   if soup was effectively present (at a rate \<= X), this
                will **mostly affect soup** genes first.
                
            -   If not, it will **decrease counts** globally, thus have
                negligible effect in differences between cells / cell
                types.
                
    -   *Cons* :
    
        -   **Counts are** already **rare** in droplet-based single cell
            technologies... Removing them "by default" makes them **even
            more rare**...
            
        -   If actual soup **rate is higher** than the defined rate X to
            remove, this mode will remain **uneffective**...

```{r soupxrm}
## Run SoupX in "removal" mode
scmat_unsoup <- SC.helper::SoupX_auto(
  scmat_filt = scmat_cells, 
  scmat_raw = scmat, 
  return_object = TRUE)

## We can remove the barcode matrix to free some RAM
rm(scmat)
```

<br>

## Visualize the "soup" removal

-   Now, we want to **characterize the effect** of the removal of the ambient
RNA, by comparing the pre- and post- soupX matrices. 

-   In this purpose, we will construct a **Seurat object** from each of these
matrices, that we will **describe**, then **represent** graphically.

-   But how to **create a Seurat object** from a count matrix ?

```{r h_CreateSeuratObject, class.source = "fold-hide", eval = FALSE}
## Reading the function help page
?Seurat::CreateSeuratObject
```

<br><br><br>

### Create and describe the objects {.tabset .tabset-fade .tabset-pills}

#### Before SoupX

-   Create the Seurat object

```{r createseuratNS}
## Create the Seurat object for unsouped cells
sobj <- Seurat::CreateSeuratObject(
  counts = scmat_cells, 
  project = "PBMC10K_BeforeSoupX")
```

-   Describe

    -   We will use a function prepared for you : `SC.helper::SeuratObject_descriptor`

```{r descseuratNS}
## Descibe the obtained Seurat object
SC.helper::SeuratObject_descriptor(sobj = sobj)
```

<br>

#### After SoupX

-   Create the Seurat object

```{r createseuratS}
## Create the Seurat object for unsouped cells
sobj_unsoup <- Seurat::CreateSeuratObject(
  counts = scmat_unsoup, 
  project = "PBMC10K_AfterSoupX")
```

-   Describe

```{r descseuratS, class.source="notrun", class.output="notruno"}
## Describe the new object content
SC.helper::SeuratObject_descriptor(sobj = sobj_unsoup)
```

##  {.unnumbered}

<br>

------------------------------------------------------------------------

------------------------------------------------------------------------

<br>

-   **Questions** :

    ```{r q_descdiff, class.source="question", eval = FALSE}
    Can you describe the difference (especially for counts and sparsity) ?
      
    Is it expected ?
    ```
    
    <br>
    
    ```{r a_descdiff, class.source = c("fold-hide", "answer"), eval = FALSE}
    ## . Total counts, max value, counts per cell and number of
    ##   expressed features per cell, all have decreased.
    ##
    ## . The sparsity level has slightly increased (removing the soup
    ##   obliterated all counts for some features in some cells).
    ##
    ## . All of this is absolutely expected.
    ```
    
<br>

What are the features most affected by the soup removal ?

```{r soupfeatures, class.source="notrun", class.output="notruno"}
## Compute the fraction matrix (PRE / POST)
##  NOTE : we have 0 counts in the divider, 
##  so we increment both matrix by +1 !
cont_frac <- (scmat_cells + 1) / (scmat_unsoup + 1)

## Fraction of counts decreased by soup removal, per feature (ordered)
feat_frac <- sort(sparseMatrixStats::rowMeans2(cont_frac), decreasing = TRUE)

### Display Top 5 features
utils::head(feat_frac, decreasing = TRUE)

## Removing objects to free some RAM
rm(cont_frac, feat_frac)
```

<br>

```{r}
## Remove unused objects to free some RAM
rm(scmat_cells, scmat_unsoup)
```

<br>

### Plot the "top 1" feature {.tabset .tabset-fade .tabset-pills}

-   We coded a small R function to help you visualize in 2D your data, even at 
such a very early step of the analysis.

-   This function will analyze your data (as a Seurat object) in the shadows with default values, and generate a 2D representation called [**UMAP**](https://umap-learn.readthedocs.io/en/latest/){target="_blank"}.

-   As default values are used, they should fit most cases. **However, do not
fully trust** by default its results as a pure ground truth, as your data may not fit these defaults... That's why the function is nammed "QnD_viz", as in "Quick and dirty visualization".

```{r h_QnD_viz, eval = FALSE}
## Reading the function help page
?SC.helper::QnD_viz
```

#### Before SoupX

```{r bsoupx, fig.height = 6, fig.width = 6}
## Top1 feature 
umap_pre <- SC.helper::QnD_viz(
  sobj = sobj, 
  features = "LYZ",
  return_plot = TRUE)
```

<br>

#### After SoupX

```{r asoupx, fig.height = 6, fig.width = 6, class.source="beyond", class.output="beyondo"}
## Top1 feature 
umap_post <- SC.helper::QnD_viz(
  sobj = sobj_unsoup, 
  features = "LYZ",
  return_plot = TRUE)
```

<br>

##  {.unnumbered}

------------------------------------------------------------------------

------------------------------------------------------------------------

Merging plots for ease of use :

```{r soupx_umaps, fig.width = 12, fig.height = 6, class.source="beyond", class.output="beyondo"}
## Using the patchwork package to merge plots (and ggplot2 to add titles)
patchwork::wrap_plots(
  list(
    umap_pre & ggplot2::ggtitle(label = "LYZ before SoupX"), 
    umap_post & ggplot2::ggtitle(label = "LYZ adter SoupX")), 
  nrow = 1)
```

<br>

-   **Question**

    ```{r q_umapsoupx, class.source="question", eval = FALSE}
    Can you compare the two plots :
      - for the LYZ expression globally ?
      - for the LYZ expression across the 2-D space ?
      - about the space topologies ?
    ```
    
    <br>
    
    ```{r compar, class.source = c("fold-hide", "answer"), eval = FALSE}
    ## . The level expression measured in cells with smaller expression
    ##   before SoupX has gone to almost none after : the ambient expression
    ##   of this gene has efficiently been removed.
    ##
    ## . Hopefully, the cluster(s) with higher expression remain(s) high.
    ##
    ## . The expression scale upper bound after SoupX is higher than before ?! 
    ##   But we REMOVED some counts ?! While being counter-intuitive, this is
    ##   a positive consequence of the ambient removal on the scaling of
    ##   barcodes expression (see later). 
    ##
    ## . The global topology of clusters remains close (but not identical).
    ```
    
<br>

------------------------------------------------------------------------

------------------------------------------------------------------------

<br><br><br>

**Beyond** :

1.  Save the `sobj` R object on disk.

```{r b_save_answer1, class.source = c("fold-hide", "beyond"), eval = FALSE}
## A single object : use RDS format with saveRDS()
saveRDS(object = sobj, file = paste0(output_dir, "/sobj.RDS"))
```

2.  Save both the `sobj` and `sobj_unsoup` objects in a single archive
    on disk.

```{r b_save_answer2, class.source = c("fold-hide", "beyond"), eval = FALSE}
## Multiple objects : use Rdata format with save()
save(list = c(sobj, sobj_unsoup), file = paste0(output_dir, "/sobjects.RDA"))
```

3.  Same as 1. but use the "bzip2" compression algorithm

```{r b_save_answer3, class.source = c("fold-hide", "beyond"), eval = FALSE}
## Compress with bzip2
saveRDS(object = sobj, file = paste0(output_dir, "/sobj.RDS"), compress = "bzip2")
```

<br>

------------------------------------------------------------------------

------------------------------------------------------------------------

<br><br><br>

# Rsession

```{r rsession, class.source="notrun", class.output="notruno"}
utils::sessionInfo()
```
