1 Introduction

During this training session, we are going to use ChIP-seq and RNA-seq data from Laurette et al. (Laurette et al. 2015) which are available in GEO as GSE61967. Data have already been processed.

1.1 Processing of RNA-seq data

1.1.1 Preprocessing

Reads were preprocessed in order to remove adapter, polyA and low-quality sequences (Phred quality score below 20). After this preprocessing, reads shorter than 40 bases were discarded for further analysis. These preprocessing steps were performed using cutadapt (Martin 2011) version 1.10.

1.1.2 Mapping

Reads were mapped onto the hg38 assembly of Homo sapiens genome using STAR (Dobin et al. 2013) version 2.5.3a.

1.1.3 Quantification

Gene expression quantification was performed from uniquely aligned reads using htseq-count version 0.6.1.p1 (S. Anders, Pyl, and Huber 2015), with annotations from Ensembl version 103 and “union” mode. Only non-ambiguously assigned reads have been retained for further analyses.

1.1.4 Normalization

Read counts have been normalized across samples with the median-of-ratios method proposed by Anders and Huber (Anders and Huber 2010), to make these counts comparable between samples.

1.1.5 Differential expression analysis

Differential expressed genes between wild type ans shBRG1 cells was performed using the edgeR R package (Robinson, McCarthy, and Smyth 2010). We called significant changes when FDR < 0.01, absolute log fold change over 1 and minimum average log normalized count over 5.

1.2 Processing of ChIP-seq data

1.2.1 Mapping

Reads were mapped to Homo sapiens genome (assembly hg38) using Bowtie (Langmead et al. 2009) v1.0.0 with default parameters except for “-p 3 -m 1 –strata –best –chunkmbs 128”. The following table shows the number of reads aligned to the Homo sapiens genome.

(#tab:stats.chipseq)Mapping statistics of ChIP-seq data analyzed during this training session. Column “Raw reads” corresponds to the number of input reads. Column “Aligned reads” corresponds to the number of reads aligned exactly 1 time. Column “Multimapped” corresponds to the number of reads aligned > 1 times. Column “Unmapped” corresponds to the number of reads aligned 0 time.
Sample ID Sample name raw reads aligned reads multimapped reads unmapped reads
SRR1594290 MITF_CHIP-seq 84856252 54369181 (64.07%) 12823841 (15.11%) 17663230 (20.82%)
SRR1594291 SOX10_CHIP-seq 38202152 29594694 (77.47%) 6128890 (16.04%) 2478568 (6.49%)
SRR1594292 BRG1siCTRL_CHIP-seq 50190992 40134919 (79.96%) 6403809 (12.76%) 3652264 (7.28%)
SRR1594293 BRG1siMITF_CHIP-seq 64765675 44480241 (68.68%) 9357361 (14.45%) 10928073 (16.87%)
SRR1594294 GFPsiCTRL_CHIP-seq 38148419 26273410 (68.87%) 5722823 (15.00%) 6152186 (16.13%)
SRR1594295 MITF_Input 29433042 19970925 (67.85%) 4049711 (13.76%) 5412406 (18.39%)
SRR1594296 SOX10_Input 35449561 27381422 (77.24%) 7045192 (19.87%) 1022947 (2.89%)
SRR1596499 BRG1siSOX10_CHIP-seq 42745544 34418403 (80.52%) 6855741 (16.04%) 1471400 (3.44%)

1.2.2 Peak Calling

Prior to peak calling, reads falling into Encode blacklisted regions (“(2014) Mod/Mouse/humanENCODE: Blacklisted Genomic Regions for Functional Genomics Analysis - Anshul Kundaje n.d.) were removed using bedtools intersect v2.26.0 (Quinlan and Hall 2010). Then peak calling was done with Macs2 v2.1.1 with default parameters.

Table 1.1: Number of peaks detected
IP sample Input sample No. of peaks
BRG1siCTRL_CHIP-seq GFPsiCTRL_CHIP-seq 72024
BRG1siMITF_CHIP-seq GFPsiCTRL_CHIP-seq 33984
BRG1siSOX10_CHIP-seq GFPsiCTRL_CHIP-seq 73267
MITF_CHIP-seq MITF_Input 9702
SOX10_CHIP-seq SOX10_Input 4538

1.2.3 Generation of BigWig files

Normalized BigWig files were generated using Homer (Heinz et al. 2010) makeUCSCfile v4.11.0 with the following parameter ‘-norm 1e7’ meaning that data were normalized to 10M reads.

1.3 Start a project and define the paths to the data

Before starting any project, it is a good idea to setup the path to the project as a variable. For example, we could use:

projPath <- "/shared/projects/2514_ebiii_n2/chipseq"
## If you are not using RStudio, you can change the working directory with:
setwd(projPath) ##NOT RUN

If you use an Rmd file and RStudio, when you open the file, your working directory should already be set as the folder where this Rmd file is located. You can check it with:

getwd()

Here, we will use different files for our analyses. We can create a list containing the absolute paths or names to the different files and folders that we will use during the practice session.

params <- list(
  workdir = "/shared/projects/2514_ebiii_n2/chipseq",
  mapping.chipseq = "stats_mappingChIPseq.tsv",
  peakcalling.chipseq = "stats_peakCalling.tsv",
  brg1 = "BRG1siCTRL_CHIP-seq_peaks.narrowPeak",
  mitf = "MITF_CHIP-seq_peaks.narrowPeak",
  sox10 = "SOX10_CHIP-seq_peaks.narrowPeak",
  brg1.bw = "BRG1siCTRL_CHIP-seq.bw",
  mitf.bw = "MITF_CHIP-seq.bw",
  sox10.bw = "SOX10_CHIP-seq.bw",
  rnaseq = "RNAseq_diff_norm.rds",
  annot = "annot.rds"
  )

2 Importing the peaks and obtaining some basic statistics

2.1 Importing peak data

Peak files are in narrowPeak format which is of the form:

  1. chrom - Name of the chromosome (or contig, scaffold, etc.).
  2. chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0 (i.e. 0-based).
  3. chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.
  4. name - Name given to a region (preferably unique). Use “.” if noname is assigned.
  5. score - Indicates how dark the peak will be displayed in the browser (0-1000). If all scores were “‘0”’ when the data were submitted to the DCC, the DCC assigned scores 1-1000 based on signal value. Ideally the average signalValue per base spread is between 100-1000.
  6. strand - +/- to denote strand or orientation (whenever applicable). Use “.” if no orientation is assigned.
  7. signalValue - Measurement of overall (usually, average) enrichment for the region.
  8. pValue - Measurement of statistical significance (-log10). Use -1 if no pValue is assigned.
  9. qValue - Measurement of statistical significance using false discovery rate (-log10). Use -1 if no qValue is assigned.
  10. peak - Point-source called for this peak; 0-based offset from chromStart. Use -1 if no point-source called.

There are different ways to import such data. We could use a simple read.table function but we know that it would not convert the 0-based coordinates to the 1-based coordinates used in R/Bioconductor.

The ChIPseeker package provides a readPeakFile function that we could use like this:

##NOT RUN
library(ChIPseeker)
peaks <- list()
peaks[["BRG1"]] <- readPeakFile(file.path(params$workdir,"data", "ChIPseq", params$brg1), 
                                as="GRanges")

However, the function is essentially a wrapper to the read.delim function and does not assign useful column names to the imported data. So instead we will use the readNarrowPeak function from the genomation package to import the narrowPeak files obtained for the 3 factors (BRG1, MITF and SOX10):

library(genomation)
peaks <- list()
peaks[["BRG1"]] <- genomation::readNarrowPeak(file.path(params$workdir,
                                                        "data", "ChIPseq", 
                                                        params$brg1))
peaks[["MITF"]] <- readNarrowPeak(file.path(params$workdir,
                                            "data", "ChIPseq", 
                                            params$mitf))
peaks[["SOX10"]] <- readNarrowPeak(file.path(params$workdir,
                                             "data", "ChIPseq", 
                                             params$sox10))

The package also provides a readGeneric function that can adapt to different peak file formats. For example, the following command is equivalent to what we did above for BRG1:

## NOT RUN
peaks[["BRG1"]] <- readGeneric(file.path(params$workdir,"data", "ChIPseq", params$mitf),
                               meta.cols = list(name = 4, score = 5, signalValue = 7,
                                                pValue = 8, qValue = 9, peak = 10),
                               zero.based = TRUE)

Let’s take a look at how the data was imported:

peaks[["BRG1"]]
## GRanges object with 72024 ranges and 6 metadata columns:
##           seqnames            ranges strand |                   name     score
##              <Rle>         <IRanges>  <Rle> |            <character> <integer>
##       [1]     chr1     980402-981178      * | BRG1siCTRL_CHIP-seq_..        77
##       [2]     chr1     983476-984526      * | BRG1siCTRL_CHIP-seq_..       108
##       [3]     chr1   1000729-1001179      * | BRG1siCTRL_CHIP-seq_..        90
##       [4]     chr1   1001762-1002054      * | BRG1siCTRL_CHIP-seq_..        40
##       [5]     chr1   1020914-1021207      * | BRG1siCTRL_CHIP-seq_..       114
##       ...      ...               ...    ... .                    ...       ...
##   [72020]     chrY 18936757-18937116      * | BRG1siCTRL_CHIP-seq_..        37
##   [72021]     chrY 19577774-19578316      * | BRG1siCTRL_CHIP-seq_..        73
##   [72022]     chrY 19891214-19892094      * | BRG1siCTRL_CHIP-seq_..       114
##   [72023]     chrY 19892553-19893192      * | BRG1siCTRL_CHIP-seq_..        55
##   [72024]     chrY 21837647-21838039      * | BRG1siCTRL_CHIP-seq_..        55
##           signalValue    pvalue    qvalue      peak
##             <numeric> <numeric> <numeric> <integer>
##       [1]     5.81321  10.22971   7.71662       166
##       [2]     5.97792  13.59661  10.83201       384
##       [3]     6.65677  11.62894   9.01201       199
##       [4]     3.78100   6.17367   4.02109       209
##       [5]     8.11634  14.29797  11.46427       158
##       ...         ...       ...       ...       ...
##   [72020]     4.53175   5.93549   3.79936       167
##   [72021]     6.32404   9.87728   7.38650       165
##   [72022]     7.86030  14.29797  11.46427       711
##   [72023]     5.29987   7.83781   5.50942       390
##   [72024]     5.55592   7.83781   5.50942       152
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

Peaks are stored as GenomicRanges (or GRanges) objects; this is an R format which looks like the bed format, but is optimized in terms of memory requirements and speed of execution.

NOTE:: when using GRanges it is always a good idea to set the seqinfo metadata (info about chromosome names and length essentially). We will do that later below.

We can start by computing some basic statistics on the peak sets.

2.2 How many peaks were called?

Compute the number of peaks per dataset. We use here a function from the apply family which help to apply recursively a given function on elements of the object. We will often use these functions along the course.

The sapply() function typically takes a list or vector as input and applies the same function (here length) to each element:

sapply(peaks,length)
##  BRG1  MITF SOX10 
## 72024  9702  4538

Note that in this specific case, base R provides a lengths function (lengths with an s at the end) that does the same, i.e. gives the length of each element of a list object:

lengths(peaks)

Make a simple barplot showing the number of BRG1 peaks.

Show code: barplot

barplot(lengths(peaks), ylab = "Number of peaks")

# or:
# barplot(lengths(peaks), ylab = "Number of peaks", log="y")

Let’s create a barplot with ggplot2 out of this data.
Step by step:
Do not forget to load the ggplot2 library

Show code: load ggplot2

# Load ggplot2 library
library(ggplot2)
  • Create a data.frame with two columns: IP contains names of the chipped TFs and NbPeaks contains the number of peaks.

Show code: create the data.frame

# create a table with the data to display
peak.lengths <- data.frame(IP = names(peaks),
                           NbPeaks = lengths(peaks))
  • use ggplot2 with the appropriate geometric object geom_*

Show code: barplot with ggplot2

# make the barplot
ggplot(peak.lengths, aes(x = IP, y = NbPeaks)) +
         geom_col()

  • Improve the barplot by changing axis labels and the theme

Show code: modify axes labels and theme

#make the barplot a bit nicer:
ggplot(peak.lengths, aes(x = IP, y = NbPeaks)) +
  geom_col() +
  labs(x = "ChIP", y = "Number of peaks") +
  theme_bw()

We can customize the plots by adding colors.

There are a number of color palettes that you can use in R and different ways to specify colors. For example, the Rcolorbrewer package is widely used and offers several color palettes:

# See:
library(RColorBrewer)
par(mar = c(3,4,2,2))
display.brewer.all()

The ggsci package offers color palettes based on scientific journals.

If you are searching for a specific color, take a look at this site.

Now lets add colors to the barplot.
Step by step:
- Add the new information fill=IP to let ggplot know that colors change based on chipped protein

Show code: color according to the chipped TF (column IP)

ggplot(peak.lengths, aes(x = IP, y = NbPeaks, fill = IP)) +
         geom_col() +
         labs(x = "ChIP", y = "Number of peaks") +
         theme_bw()

We want to use colors from RColorBrewer library with the “Set1” color palette. To set your own colors to fill the plot, several functions are available (scale_fill_*), here we use scale_fill_brewer.

Show code: color according to the chipped TF (column IP) and set RColorBrewer palette to Set1

ggplot(peak.lengths, aes(x = IP, y = NbPeaks, fill = IP)) +
         geom_col()+
         scale_fill_brewer(palette="Set1") +
         labs(x = "ChIP", y = "Number of peaks") +
         theme_bw()

And now use one of the ggsci palette.

Show code: use a ggsci palette

## Load ggsci library
library(ggsci)

## Make the plot with the NPG palette
ggplot(peak.lengths, aes(x = IP, y = NbPeaks, fill = IP)) +
         geom_col()+
         scale_fill_npg() +
         labs(x = "ChIP", y = "Number of peaks") +
         theme_bw()

2.3 How large are these peaks?

The peaks were loaded as a GRanges objects.
- To manipulate them we load the GenomicRanges library. Then we use the appropriate function to retrieve the width of the peaks from the GRanges and use a function to summarize the peak length (display the statistics like minimum, maximum, quartiles, etc.)

Show code: Display peak length statistics for BRG1 only

## we use the function width() from GenomicRanges
library(GenomicRanges)
summary(width(peaks$BRG1))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   254.0   308.0   427.0   578.7   671.0  9186.0
# you can also try with the skim function from the skimr package:
# skimr::skim(width(peaks$BRG1))

Create a simple boxplot of the peak sizes.

Show code: Display peak length statistics as a boxplot

# use a log scale on the y axis to facilitate the comparisons
peak.width <- lapply(peaks, width)
boxplot(peak.width, ylab = "peak width", log = "y")

Now, create a nice looking boxplot with ggplot2. ggplot takes a data frame as input. We can either create a date frame, this is what we have done when we created the barplot. Here, we are going to use the package reshape2 which, among all its features, can create a data frame from other types of data, in particular list objects, via the melt function:

# Load the package
library(reshape2)
peak.width.table <- melt(peak.width)
head(peak.width.table)
##   value   L1
## 1   777 BRG1
## 2  1051 BRG1
## 3   451 BRG1
## 4   293 BRG1
## 5   294 BRG1
## 6   351 BRG1

Create a ggplot object with correct aesthetics to display a boxplot according the chipped TF (x-axis) and their width (y-axis). Then use appropriate geometric object geom_*.

Show code: Display peak length statistics with ggplot2

# create boxplot
ggplot(peak.width.table, aes(x = L1, y = value)) +
         geom_boxplot()

By default, the background color is grey and often does not allow to highlight properly the graphs. Many theme are available defining different graphics parameters, they are often added with a function theme_*(), the function theme() allows to tune your plot parameter by parameter. We propose here to use theme_classic() that changes a grey background to white background. Some peaks are very long and squeeze the distribution, one way to make the distribution easier to visualize is to transform the y axis to log scale. Finally, we want to remove the x label, change the y label to ‘Peak widths’ and the legend relative to the fill color from “L1” to “TF”.
Using the ggplot2 documentation (e.g. tidyverse or google it, try to enhance the boxplot.

Show code: Enhance it !

# - theme_classic() change grey background to white background
# - fill=L1 and scale_fill_brewer(palette="Set1") colors boxplots
# based on chipped protein and with colors from RColorBrewer Set1 palette
# - labs changes x and y axis labels and legend title
# - scale_y_log10() set y axis to a log scale so that we can have a nice
# view of the data in small values
ggplot(peak.width.table, aes(x = L1, y = value, fill = L1)) +
         geom_boxplot()+
         theme_classic()+
         scale_fill_brewer(palette = "Set1")+
         labs(x = "", y = "Peak sizes", fill = "TF")+
         scale_y_log10()

2.4 Peak filtering

To make sure we keep only high quality data. We are going to select those peaks having a \(qvalue >= 4\). In the GRanges metadata columns (mcols) one of them is the qvalue. So, we are going to set a threshold on this column. How many peaks are selected ?

Show the code: select high quality peaks for BRG1 and count them

## Select BRG1 peaks with qvalue>=4 and count them
length( peaks$BRG1[peaks$BRG1$qvalue >= 4] )
## [1] 60184

Now we want to apply this function to all IPs.

Show the code: filter the peaks for all IPs

## Select BRG1 peaks with qvalue>=4 and count them
peakfilt <- lapply(peaks,
                   function(x) {
                     x[x$qvalue >= 4]
                   })
lengths(peakfilt)
##  BRG1  MITF SOX10 
## 60184  9279  4214

3 Peak annotation relative to genomic features and peak overlaps

3.1 Where are the peaks located across the whole genome?

Sometime, peaks may occur more on some chromosomes than others or there can be regions where peaks are absent (example: repetitive regions that have been filtered out). We can display the genomic distribution of peaks along the chromosomes, using the covplot function from ChIPSeeker. Height of peaks is drawn based on the peak scores.
Let’s use this function on SOX10 for example:

3.2 Where are the peaks located relative to genomic features?

Here, the aim is to identify where the peaks tend to be located relative to typical genome annotations such as promoters, exons, introns, intergenic regions, etc.
We will assign peaks to genomic features (genes, introns, exons, promoters, distal regions, etc.) by checking if they overlap with these features.
We will illustrate this task using functions from the ChIPseeker package.
But first we need to retrieve the annotations

3.2.1 Retrieve genome annotation (genes, transcripts, exons, etc.)

We load ready-to-use annotation objects from Bioconductor:

  • a OrgDB object: org.Hs.eg.db and
  • a TxDB object: TxDb.Hsapiens.UCSC.hg38.knownGene
## org.Hs.eg.db is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.
library(org.Hs.eg.db)
## Load transcript-oriented annotations
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene

Since we now have a TxDb object that contains the seqinfo metadata, we take the opportunity to transfer this metadata to our peaks.

peaks <- lapply(peaks, function(x) {seqinfo(x) <- seqinfo(txdb); return(x)})
peakfilt <- lapply(peakfilt, function(x) {seqinfo(x) <- seqinfo(txdb); return(x)})

3.2.2 Annotate the peaks relative to genomic features

This is done using the function annotatePeak which compares peak positions with the coordinates of genomic features on the same reference genome.
This function returns a complex object which contains all this information.
As we did for the peaks coordinates, we store the peak annotations in a list that we first initialize as an empty list peakAnno <- list().
Here, we define the TSS regions -1000 bp to +100 bp from the TSS itself.
Using the help of the function and annotation objects we have loaded, try to write the correct command line to annotate BRG1 peaks.

Show code: annotate BRG1 peaks

## Annotate peaks for all datasets and store it in a list
## Here TSS regions are regions -1000Kb/+100b arount TSS positions
## Peak annotations are stored in a list
peakAnno <- list()
peakAnno[["BRG1"]] = annotatePeak(peakfilt$BRG1, 
                                  tssRegion = c(-1000, 100), 
                                  TxDb = txdb, 
                                  annoDb = "org.Hs.eg.db")
## >> preparing features information...      2025-05-29 11:10:43 AM 
## >> identifying nearest features...        2025-05-29 11:10:44 AM 
## >> calculating distance from peak to TSS...   2025-05-29 11:10:46 AM 
## >> assigning genomic annotation...        2025-05-29 11:10:46 AM 
## >> adding gene annotation...          2025-05-29 11:11:12 AM
## >> assigning chromosome lengths           2025-05-29 11:11:12 AM 
## >> done...                    2025-05-29 11:11:12 AM
class(peakAnno$BRG1)
## [1] "csAnno"
## attr(,"package")
## [1] "ChIPseeker"
## Visualize and export annotation as a data table
# as.data.frame(peakAnno$BRG1)
head(as.data.frame(peakAnno$BRG1))
##   seqnames   start     end width strand                       name score
## 1     chr1  980402  981178   777      * BRG1siCTRL_CHIP-seq_peak_1    77
## 2     chr1  983476  984526  1051      * BRG1siCTRL_CHIP-seq_peak_2   108
## 3     chr1 1000729 1001179   451      * BRG1siCTRL_CHIP-seq_peak_3    90
## 4     chr1 1001762 1002054   293      * BRG1siCTRL_CHIP-seq_peak_4    40
## 5     chr1 1020914 1021207   294      * BRG1siCTRL_CHIP-seq_peak_5   114
## 6     chr1 1059581 1059989   409      * BRG1siCTRL_CHIP-seq_peak_8    68
##   signalValue   pvalue   qvalue peak
## 1     5.81321 10.22971  7.71662  166
## 2     5.97792 13.59661 10.83201  384
## 3     6.65677 11.62894  9.01201  199
## 4     3.78100  6.17367  4.02109  209
## 5     8.11634 14.29797 11.46427  158
## 6     5.43367  9.31879  6.86163  239
##                                          annotation geneChr geneStart geneEnd
## 1                                            5' UTR       1    975198  982093
## 2                                 Distal Intergenic       1    976121  982117
## 3                                          Promoter       1   1001138 1014540
## 4    Intron (ENST00000624697.4/9636, intron 1 of 2)       1   1001145 1014435
## 5 Intron (ENST00000379370.7/375790, intron 1 of 35)       1   1020123 1056118
## 6                                          Promoter       1   1059687 1066449
##   geneLength geneStrand    geneId      transcriptId distanceToTSS
## 1       6896          2     84808 ENST00000433179.4           915
## 2       5997          2     84808 ENST00000694917.1         -1359
## 3      13403          1      9636 ENST00000624697.4             0
## 4      13291          1      9636 ENST00000624652.1           617
## 5      35996          1    375790 ENST00000620552.4           791
## 6       6763          1 100288175 ENST00000688131.1             0
##           ENSEMBL       SYMBOL                                     GENENAME
## 1 ENSG00000187642        PERM1 PPARGC1 and ESRR induced regulator, muscle 1
## 2 ENSG00000187642        PERM1 PPARGC1 and ESRR induced regulator, muscle 1
## 3 ENSG00000187608        ISG15                ISG15 ubiquitin like modifier
## 4 ENSG00000187608        ISG15                ISG15 ubiquitin like modifier
## 5 ENSG00000188157         AGRN                                        agrin
## 6 ENSG00000291156 LOC100288175                 uncharacterized LOC100288175

All peak information contained in the peak list is retained in the output of annotatePeak. Positions and strand information of the nearest genes are reported. The distance from the peak to the TSS of its nearest gene is also reported. The genomic region where the peak lies is reported in the annotation column. Since some annotation may overlap (e.g. a 5’ UTR is always located in an exon), ChIPseeker adopted the following priority in genomic annotations:

  • Promoter
  • 5’ UTR
  • 3’ UTR
  • Exon
  • Intron
  • Downstream
  • Intergenic
  • Downstream (i.e. downstream of the gene 3’ end).

This hierachy can be customized using the parameter genomicAnnotationPriority.

The annotatePeak function reports detailed information when the annotation is Exon or Intron, for instance “Intron (ENST00000624697.4/9636, intron 1 of 2)” means that the peak is overlapping with the first intron of transcript ENST00000624697.4, whose corresponding Entrez gene ID is 9636 (Transcripts that belong to the same gene ID may differ in splice events).

The “annoDb” parameter is optional. If it is provided, some extra columns including SYMBOL, GENENAME, ENSEMBL/ENTREZID are added.

Reminder: The TxDb class is a container for storing transcript annotations.

  • Bioconductor provides several packages containing TxDb objects for model organisms sur as Human and mouse. For instance, TxDb.Hsapiens.UCSC.hg38.knownGene, TxDb.Hsapiens.UCSC.hg19.knownGene for human genome hg38 and hg19, TxDb.Mmusculus.UCSC.mm10.knownGene and TxDb.Mmusculus.UCSC.mm9.knownGene for mouse genome mm10 and mm9, etc.

  • The user can also prepare their own TxDb by retrieving information from UCSC Genome Bioinformatics and BioMart data resources using the R functions makeTxDbFromBiomart and makeTxDbFromUCSC that are now in the txdbmaker package.

  • One can also create a TxDb objects for any organism using an annotation file in GTF/GFF format using the function makeTxDbFromGFF of the txdbmaker package (GenomicFeatures package in older R versions).

Expand to find Coturnix japonica example

## NOT RUN
## download GTF file
download.file("https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/93934/101/GCF_001577835.2_Coturnix_japonica_2.1/GCF_001577835.2_Coturnix_japonica_2.1_genomic.gtf.gz", "Coturnix_japonica_2.1.annotation.gtf.gz")
## Build TxDb object
library(GenomicFeatures)
txdb = makeTxDbFromGFF("Coturnix_japonica_2.1.annotation.gtf.gz")
## To save the txdb database
library(AnnotationDbi)
saveDb(txdb, 'txdb.Coturnix_japonica_2.1.sqlite')
## load it when needed
library(AnnotationDbi)
txdb = loadDb(file = 'txdb.Coturnix_japonica_2.1.sqlite')

3.2.3 Plot the results of peak annotation

We can vizualize the way our peaks have been annotated relative to genomic features (introns, exons, promoters, distal regions,…) using a pie chart with plotAnnoPie or as a bar chart plotAnnoBar which facilitates the comparison between different sets of peaks.

Show code: Display peak annotations of BRG1 peaks using a pie chart

## distribution of genomic features for BRG1 peaks
# as a pie chart - which is the most widely used representation in publication
plotAnnoPie(peakAnno$BRG1)

Now we generate annotations for SOX10 and MITF peaks.

Show code: Generate annotations for SOX10 and MITF peaks

## Use the annotatePeak function
peakAnno[["SOX10"]] <- annotatePeak(peakfilt$SOX10, 
                                    tssRegion = c(-1000, 100), 
                                    TxDb = txdb, 
                                    annoDb = "org.Hs.eg.db")
## >> preparing features information...      2025-05-29 11:11:16 AM 
## >> identifying nearest features...        2025-05-29 11:11:16 AM 
## >> calculating distance from peak to TSS...   2025-05-29 11:11:17 AM 
## >> assigning genomic annotation...        2025-05-29 11:11:17 AM 
## >> adding gene annotation...          2025-05-29 11:11:20 AM
## >> assigning chromosome lengths           2025-05-29 11:11:20 AM 
## >> done...                    2025-05-29 11:11:20 AM
peakAnno[["MITF"]] <- annotatePeak(peakfilt$MITF, 
                                   tssRegion = c(-1000, 100), 
                                   TxDb = txdb, 
                                   annoDb = "org.Hs.eg.db")
## >> preparing features information...      2025-05-29 11:11:20 AM 
## >> identifying nearest features...        2025-05-29 11:11:20 AM 
## >> calculating distance from peak to TSS...   2025-05-29 11:11:20 AM 
## >> assigning genomic annotation...        2025-05-29 11:11:20 AM 
## >> adding gene annotation...          2025-05-29 11:11:23 AM
## >> assigning chromosome lengths           2025-05-29 11:11:24 AM 
## >> done...                    2025-05-29 11:11:24 AM
## We could also have generated annotations for all factors at the same time using:
# peakAnno <- lapply(peakfilt, annotatePeak,  tssRegion=c(-1000, 100), TxDb=txdb, annoDb="org.Hs.eg.db")

Now display the results as a stacked barplot using peakAnnoBar.

Show code: Generate stacked barplot of peak annotations

## Use the plotAnnoBar function
plotAnnoBar(peakAnno)

However since some annotation overlap, ChIPseeker also provides functions that help evaluate full annotation overlaps.
For example, the upsetplot function which produces an UpSet plot similar to those produced by the library(UpSetR) package.

upsetplot(peakAnno$BRG1)

3.3 Do the peaks of the different factors overlap?

3.3.1 Venn Diagram

We see that the peaks for the 3 factors SOX10, MITF and BRG1 are present to a certain extent in similar regions (although this could be due to the proportions of these regions in the genome). But this doesn’t tell us if the peaks actually overlap. An overlap between the peaks may mean that the factors co-localize, although we won’t know if it’s the case in single cells.

Overlaps between peak datasets can be evaluated using a Venn diagram. Such an approach is implemented in the ChIPpeakAnno package.

After loading the library, we start by computing the overlap between peak sets, keeping the information of all peaks overlapping in each set (see ?findOverlapsOfPeaks for help).

library(ChIPpeakAnno)
ovl <- findOverlapsOfPeaks(peakfilt, connectedPeaks = "keepAll")

ChIPpeakAnno imposes, while plotting the Venn diagram, to compute the significance of the pairwise associations using a hypergeometric test. To this end, we need to estimate the number of all potential binding events which is used by the makeVennDiagram function through the totalTest number. It is used for the hypergeometric sampling that is used to determine if the overlap between two datasets is more than would be expected by chance. This is not a trivial question, the answer is driven by what you know about the binding properties of your factors (eg. sequence specific, mainly intergenic etc). You can find an interesting discussion here. In our case we can refer to the genomic distribution of the peaks that we have plotted previously. We can assume that our TFs have a gene body binding preference. Genes cover roughly 10% of the genome.

# Estimate the average size of the peaks ...
averagePeakWidth <- mean(width(unlist(GRangesList(ovl$peaklist))))
# ... to count how many potential sites could have been bound in coding regions.
tot <- ceiling(3.3e+9 * 0.1 / averagePeakWidth)

TIPS: We can define the colors attributed to each set using the colours() function. Whenever you want to set a color you can use this function.
Here are some useful pages about colors in R and elsewhere:

Now we can plot the overlap of the peaks as a Venn Diagram:

makeVennDiagram(ovl, totalTest = tot, connectedPeaks = "keepAll",
                fill = brewer.pal(3, "Set1"), # circle fill color
                col = brewer.pal(3, "Set1"), #circle border color
                cat.col = brewer.pal(3, "Set1"))

According to the hypergeometric test p-values all pairwise comparisons are highly significant. This has to be taken very carefully as it largely depends on the background estimation that we may have over-estimated. If you are not sure about your estimation, you can use a non-parametric approach based on your peak genomic distribution to estimate randomness. To this end the TxDb.Hsapiens.UCSC.hg38.knownGene is used.

3.3.1.1 Are the overlaps significant?

ChIPpeakAnno provides the preparePool and peakPermTest functions to compute this test. These tests are made by pair, we will look at SOX10/MITF overlap significance as an example.

# Prepare a pool of random peaks following the characteristics of our peak sets
pool <- preparePool(txdb, peaks$SOX10,
                    bindingType = "TSS",
                    featureType = "transcript",
                    seqn = paste0("chr",c(1:22, "X", "Y")))

# Create the permPool object needed for peakPermTest
pool <- new("permPool", grs=pool$grs[1], N=length(peaks$SOX10))
SOX10.MITF <- peakPermTest(peaks$SOX10, peaks$MITF, 
                           pool = pool, 
                           seed = 123, 
                           force.parallel = FALSE)
SOX10.MITF
## $cntOverlaps
## P-value: 0.0099009900990099
## Z-score: 149.1501
## Number of iterations: 100
## Alternative: greater
## Evaluation of the original region set: 535
## Evaluation function: cntOverlaps
## Randomization function: randPeaks
## 
## attr(,"class")
## [1] "permTestResultsList"
plot(SOX10.MITF)

Venn diagrams are widely used to represent overlaps, intersections. However, in ChIP-seq analysis, the definition of the peaks is dependent on the peak caller, the FDR thresholds (what about peaks just below the threshold?). The overlap is also difficult to assess, indeed do we call an overlap of 1 nucleotide an intersection ? It is one of the numerous parameters that can be tuned…

Another way to explore the results is to evaluate if different combinations of the factors are associated to specific genomic features. We can use again the annotatePeak function to evaluate this:

coocs <- as(ovl$peaklist, "GRangesList")
# apply annotatePeak to each set of peaks in the list coocs
peakAnnoList <- lapply(coocs, annotatePeak, 
                       TxDb = txdb, annoDb = "org.Hs.eg.db",
                       tssRegion = c(-1000, 100),
                       verbose = FALSE)

And then plot the results:

# Plot peak distribution relative to gene features
plotAnnoBar(peakAnnoList)

# Plot peak distribution relatively to their distance to the TSS
plotDistToTSS(peakAnnoList)

4 Heatmaps and average profiles

Heatmaps and average profiles are widely used representations of ChIP-seq data as they allow simultaneous visualization of read enrichment at various locations. For instance, one may want to represent reads related to a chipped protein in regions spanning +/-5Kb around all TSS of the reference genome. Another objective could be to compare read enrichment at the same locations in many chip-seq datasets.

4.1 Distribution of BRG1 peaks

We want to know if BRG1 is binding large and/or narrow regions, unique and/or tandem etc. So we will study the distribution of the BRG1 peaks that have been called in a region spanning +/-2Kb around the peak centers.

First of all, for calculating the profile of ChIP peaks around the BRG1 peak center, we need to define BRG1 peak centers and extend them 2Kb on each side. We will then select the 10000 best peaks using the q-value to get a lighter matrix. Then, we need to “align” the peaks that are mapped to these regions, and thus generate a matrix with the different regions in row and the position around the peak centers in columns. The resulting matrix will be 2000*2+1 columns and 10000 rows.

Step by step:
- Compute the centers of BRG1 peaks using the GenomicRanges `resize`` function.

Show code: compute the peak centers for BRG1

# Compute the centers of BRG1 peaks
peak_centers_BRG1 <- resize(peakfilt$BRG1, width = 1, fix = "center")

Note that the mid function applied to a GRanges object returns the centers of the ranges but as an integer vector, not a GRanges object.

  • Now select the BRG1 peak centers that correspond to the 10,000 most significant BRG1 peaks (based on qvalue)

Show code: Select peak centers for the top 10k peaks

# For computation and memory efficiency reasons,
# we subset the top 10K peaks according to the qvalue
top10k_BRG1 <- peak_centers_BRG1[order(peak_centers_BRG1$qvalue, 
                                       decreasing = TRUE)][1:10000]
  • Create a GRanges object with the regions corresponding to +/-2kb around these peak centers.

Show code: Create GRanges with +/-2kb around top10k BRG1 peak centers

# Generate GRanges of peak centers +/-2kb for the 10K top peaks
top10k_BRG1_extend2kb <- top10k_BRG1 + 2000
# Note that we could have obtained this GRanges for all peaks directly using:
# BRG1_extend2kb <- resize(peakfilt$BRG1, width = 4001, fix = "center"))

4.1.1 BRG1 heatmap and average profile with ChIPseeker

First, we will use the functions available in the ChIPseeker package to generate the heatmap. Later, we will compare the results to those obtained with the genomation package.

  • ChIPseeker actually requires that we generate the regions of interest (top10k BRG1 peak centers +/-2kb) using the makeBioRegionFromGranges.

Show code: prepare the “BioRegions” using ChIPseeker function

BRG1_extend_2kb <- makeBioRegionFromGranges(top10k_BRG1,
                                            by = "peaks",
                                            type = "start_site",
                                            upstream = 2000,
                                            downstream = 2000)
  • We now have the regions that we need to map the peaks to. Create the matrix using the getTagMatrix function and the parameter windows.

Show code: Create the matrix

## compute the coverage of peaks in these regions
BRG1_tagmat <- getTagMatrix(peakfilt$BRG1, 
                            windows = BRG1_extend_2kb)
## >> preparing start_site regions by peaks... 2025-05-29 10:56:26 AM
## >> preparing tag matrix...  2025-05-29 10:56:26 AM
  • Now display the matrix with tagHeatmap.

Show code: Display the heatmap

## plot the heatmap
tagHeatmap(BRG1_tagmat)

We will see below that the rows have been reordered by their average signal (i.e. what you would get by using rowMeans on the tagMatrix).

Here we used the centers of BRG1 peaks to define the regions of interest, so it is not surprising that peaks are nicely aligned on these centers and, because of the sorting applied to the rows, that wide peaks are at the top and narrower peaks at the bottom of the heatmap.

  • Now let’s see how the distribution of peaks look around the TSS. ChIPseeker provides an easy function to do this: peakHeatmap that uses directly the TxDb object that we used previously.

Show code: Plot BRG1 peak distribution around the TSS

## plot the heatmap of BRG1 peaks around the TSS
peakHeatmap(peakfilt$BRG1, TxDb = txdb, 
            upstream = 2000, downstream = 2000)
## >> preparing promoter regions...  2025-05-29 11:12:09 AM 
## >> preparing tag matrix...        2025-05-29 11:12:09 AM 
## >> preparing start_site regions by gene... 2025-05-29 11:12:10 AM
## >> preparing tag matrix...  2025-05-29 11:12:10 AM 
## >> generating figure...       2025-05-29 11:12:17 AM 
## >> done...            2025-05-29 11:12:20 AM

We can also display the average profile of BRG1 peaks across these regions. This corresponds to calculating the mean signal of each column of the matrix and plotting the result. So these “average profiles” are sometimes plotted above the heatmap.

In ChIPseeker we can do this on the tagMatrix object using the plotAvgProf function.

Show code: Plot the average profile of BRG1 peaks around their centers

## compute the coverage of peaks in these regions
plotAvgProf(BRG1_tagmat,
            xlim = c(-2000, 2000),
            xlab = "Distance to peak center",
            ylab = "Peak Count Frequency")
## >> plotting figure...             2025-05-29 11:14:31 AM

NOTE:: this is especially useful when groups of regions have been defined on the matrix, for example representing different binding profiles of the factor.

4.1.2 BRG1 heatmap with genomation

Now let’s see how to obtain similar results with the genomation package.

Step by step:
- The equivalent of the tagMatrix function is SignalMatrix, but the latter directly accepts a regular GRanges object to specify the windows / regions of interest.

Show code: Prepare the matrix

## Prepare the matrix of BRG1 peaks across BRG1 peak centers +/-2kb
sm_BRG1 <- ScoreMatrix(target = peakfilt$BRG1, 
                       windows = top10k_BRG1_extend2kb)
  • Plot the heatmap with heatMatrix.

Show code: Plot the heatmap

## Plot the heatmap
heatMatrix(sm_BRG1, xcoords = c(-2000, 2000))

You notice that the heatmap doesn’t look like the one generated by the ChIPseeker function tagHeatmap. This is because by default, the heatMatrix function does not reorder the rows. Since we have sorted the rows by decreasing qvalue, we can verify this by using the qvalue as a weight when calculating the coverage of BRG1 peaks over the regions. This is done by using the weight.col argument in ScoreMatrix.

  • Color the coverage by qvalue using the weight.col argument in ScoreMatrix.

Show code: Plot the heatmap with qvalue as weight

## Plot the heatmap using qvalue as weight
heatMatrix(ScoreMatrix(target = peakfilt$BRG1, 
                       windows = top10k_BRG1_extend2kb, 
                       weight.col = "qvalue"), 
           xcoords = c(-2000, 2000))

  • Now use the order argument in heatMatrix to reorder the rows

Show code: Plot the heatmap reordering the rows

## Plot the heatmap with reordered rows
heatMatrix(sm_BRG1, xcoords = c(-2000, 2000), order = TRUE)

Take a look at ?heatMatrix and at the genomation vignette, there are other useful arguments, for example to make groups or cluster rows.

  • Now let’s plot the distribution of BRG1 peaks around the TSS using genomation. To do this, we will use the gene and promoters functions on the TxDb object to extract the coordinates of the regions around the TSS.

Show code: Heatmap of BRG1 peaks around the TSS with genomation

## we also add a confidence interval around the mean
heatMatrix(ScoreMatrix(target = peakfilt$BRG1, 
                       windows = promoters(genes(txdb),
                                           upstream = 2000,
                                           downstream = 2001)), 
           xcoords = c(-2000, 2000),
           order = TRUE)

Again, we notice that ChIPseeker function had filtered out all rows with zero coverage around the TSS, while genomation function does not.

  • genomation has the plotMeta function to create average profiles

Show code: Make an average profile with genomation

## Average profile with genomation
## we also add a confidence interval around the mean
plotMeta(sm_BRG1, xcoords = c(-2000, 2000), dispersion = "se")

4.1.3 Multiple heatmaps: comparing factors across identical regions

Heatmaps are useful to compare the profiles obtained from different factors on the same regions. For examples, we could ask: How is the signal for BRG1 and MITF distributed around SOX10 peaks. genomation has useful functions to analyse a list of several ChIP-seq results, provided as GRanges or RleList objects.

Step by step:
- Get the +/-2kb regions around SOX10 peak centers.

Show code: Get +/-2kb regions around SOX10 peak centers

## Obtain regions of interest (SOX10 peak centers +/- 2kb)
SOX10_extend2kb <- resize(peakfilt$SOX10, width = 4001, fix = "center")
  • Use the ScoreMatrixList function to obtain the profiles of all factors in these regions.

Show code: Get the ScoreMatrix for each factors

## Get the coverage of peaks for all factors around SOX10 peak centers
sm_sox10peaks <-  ScoreMatrixList(target = peakfilt[c("SOX10", "BRG1", "MITF")], 
                                  windows = SOX10_extend2kb)
  • Plot the heatmaps side by side using multiHeatMatrix.

Show code: Plot the 3 heatmaps side by side

## Get the coverage of peaks for all factors around SOX10 peak centers
## We order the rows by the average signal across all matrices and use a common color scale for all heatmaps
multiHeatMatrix(sm_sox10peaks, 
                xcoords = c(-2000, 2000),
                 order = TRUE,
                common.scale = TRUE)

There seems to be little link between SOX10 and MITF binding but there is some link between BRG1 binding and SOX10

4.2 Read enrichment in regions of interest

So far, we’re only looking at heatmaps of peak presence/absence. Even if we sort the regions by their intensity or by qvalue, this gives little quantitative information on the actual links between the factors.
To dig into this, we will look at the coverage of reads, instead of peaks, in specific regions.

4.2.1 Heatmap of BRG1 read coverage

We will first look at the enrichment of reads from the BRG1 ChIP in regions covering +/-5kb around BRG1 peak centers. Because this represent ~10kb regions we will create bins and summarize peak density within these bins.

Step by step:
- Get the +/-5kb regions around BRG1 peak centers (only the top10K peaks).

Show code: Get +/-5kb regions around BRG1 peak centers

## Obtain regions of interest (BRG1 peak centers +/- 5kb)
extended.5K.BRG1 <- top10k_BRG1 + 5000
  • Import the read coverage for BRG1 ChIP.

Show code: Import the bigwig file for BRG1 using rtracklayer import function

## Import BRG1 bigwig file as an RleList
cvg.BRG1 <- import(file.path(params$workdir, "data", "ChIPseq", params$brg1.bw),
                      format = "BigWig",
                      which = extended.5K.BRG1,
                      as = "RleList")
cvg.BRG1
## RleList of length 25
## $chr1
## numeric-Rle of length 248956422 with 1776460 runs
##   Lengths: 975789     11      5     35      2 ...   1288     73     36  22594
##   Values :   0.00   0.32   0.81   1.13   0.97 ...   0.00   0.17   0.33   0.00
## 
## $chr10
## numeric-Rle of length 133797422 with 693950 runs
##   Lengths: 270215     30     64     37      6 ...     90     58    154 272220
##   Values :   0.00   0.65   0.97   0.81   0.65 ...   0.65   0.00   0.33   0.00
## 
## $chr11
## numeric-Rle of length 135086622 with 740549 runs
##   Lengths: 202802     21    117    221     41 ...      9     70    154  74826
##   Values :   0.00   0.49   0.16   0.00   0.16 ...   0.16   0.00   0.16   0.00
## 
## $chr12
## numeric-Rle of length 133275309 with 769359 runs
##   Lengths: 159158    105    616     25    129 ...     24    372    154  89209
##   Values :   0.00   0.16   0.00   0.16   0.49 ...   0.16   0.00   0.16   0.00
## 
## $chr13
## numeric-Rle of length 114364328 with 467206 runs
##   Lengths: 18172785       58        1       21 ...       44      110    40729
##   Values :     0.00     0.16     0.65     0.97 ...     0.49     0.16     0.00
## 
## ...
## <20 more elements>

The read coverage is imported as an RleList where each element represents a chromosome and is a long vector of coverage values along this chromosome, represented as a read-length encoded (Rle) vector.
Here, we used the rtracklayer import function but below we will also illustrate the readBigWig function from the genomation package.

  • Summarize the coverage in 200 bins.

Show code: Use the ScoreMatrixBin function from genomation tu summarize the coverage in 200 bins

## Obtain the matrix with 200 bins
smr_BRG1 <- ScoreMatrixBin(target = cvg.BRG1, 
                           windows = extended.5K.BRG1, 
                           bin.num = 200)
  • Plot the heatmap.

Show code: Use the heatMatrix function to plot the heatmap

## Plot the heatmap, ordering the regions by their average intensity
heatMatrix(smr_BRG1, 
           xcoords = c(-5000, 5000),
           order = TRUE) 

  • A few extreme values in the matrix are generating a color scale that doesn’t allow us to visualize the majority of the data. Use the winsorize argument in heatMatrix to correct the color scale.

Show code: Use the winsorize argument in heatMatrix to adjust the color scale

## Plot the heatmap, ordering the regions by their average intensity
heatMatrix(smr_BRG1, 
           xcoords = c(-5000, 5000),
           winsorize = c(0, 99.9),
           order = TRUE) 

  • Try the scaleScoreMatrix function from genomation to see the effect on the heatmap

Show code: Plot the heatmap of a scaled matrix

## Plot the heatmap for the scaled matrix
heatMatrix(scaleScoreMatrix(smr_BRG1), 
           xcoords = c(-5000, 5000), 
           order = TRUE) 

  • Identify different patterns of read coverage using the clustfunargument

Show code: Use the clustfun argument to group the peaks based on their patterns of read coverage by using k-means clustering (e.g. 5 groups)

## Make 5 groups of peaks based on the pattern of read coverage around the peaks
set.seed(123) # for reproducibility since we're using k-means clustering
heatMatrix(smr_BRG1, 
           xcoords = c(-5000, 5000), 
           winsorize = c(0, 99.9),
           clustfun = function(x) kmeans(x, centers=5)$cluster) 

In practice we would probably prefer to perform the clustering outside of the call to the heatMatrix function. But when the clustfun argument is used, the function returns the clustering result so that we can reuse it for other analysis.

4.2.2 MultiHeatmap to compare read coverage of the 3 factors

Now let’s investigate the read coverage of the 3 factors around SOX10 peaks. We will choose a window of +/-3kb around SOX10 peaks and sort them by decreasing qvalue. We will also summarize the coverages across these regions using 300 bins.

Step by step:
- Get the +/-3kb regions around SOX10 peak centers sorted by decreasing qvalues.

Show code: Get +/-3kb regions around sorted SOX10 peak centers

## Obtain regions of interest (SOX10 peak centers +/- 2kb)
SOX10_extend3kb_sorted <- resize(peakfilt$SOX10[order(peakfilt$SOX10$qvalue, 
                                                      decreasing = TRUE)],
                                 width = 6001,
                                 fix = "center")
  • Import the bigwig files for the 3 factors using the readBigWig function from the genomation package.

Show code: Import read coverage for the 3 factors

## Import read coverage for the 3 factors
cvg <- list()
cvg[["SOX10"]] <- readBigWig(file.path(params$workdir, "data", "ChIPseq", params$sox10.bw),
                             windows = SOX10_extend3kb_sorted)
cvg[["BRG1"]] <- readBigWig(file.path(params$workdir, "data", "ChIPseq", params$brg1.bw),
                            windows = SOX10_extend3kb_sorted)
cvg[["MITF"]] <- readBigWig(file.path(params$workdir, "data", "ChIPseq", params$mitf.bw),
                            windows = SOX10_extend3kb_sorted)
  • Get the ScoreMatrices using the ScoreMatrixList

Show code: Compute score matrices for the 3 factors using 300 bins

## Obtain the score matrices
smr_at_SOX10peaks <- ScoreMatrixList(target = cvg, 
                                     windows = SOX10_extend3kb_sorted, 
                                     bin.num = 300)
  • Plot the heatmaps (use winsorization to adjust the color scale)

Show code: Plot the heatmaps using multiHeatMatrix

## Plot the heatmaps side-by-side
multiHeatMatrix(smr_at_SOX10peaks, 
                xcoords = c(-3000, 3000), 
                winsorize = c(0, 99.9))

Note that if you use the argument order = TRUE, the rows will be sorted by the average of the signal across the 3 matrices, which is why it is important to sort the regions of interest before plotting the heatmaps and to use a function that keeps this row order.

4.2.3 Customized heatmaps with the EnrichedHeatmap package

So far, we have used functions from ChIPseeker and genomation to plot the heatmaps. While these functions offer some level of customization, you may need further customization in certain cases. The EnrichedHeatmap package offers an almost complete level of customization to plot heatmaps and heatmap annotations from functional genomic experiments.
The package is built on the ComplexHeatmap package, which has an extensive documentation and examples.

As an example, let’s plot the enrichment of SOX10 and BRG1 reads around SOX10 peaks.

Step by step:
- First convert the ScoreMatrix objects to normalizedMatrix objects that are used by EnrichedHeatmap. You can create a normalizedMatrix using the normalizeToMatrix function but since we already computed the matrices, we will instead convert them using as.normalizedMatrix

Show code: Convert a ScoreMatrix to a normalizedMatrix

## Convert signal matrices to normalizedMatrix objects
normat <- list()

## For BRG1 signal:
normat[["BRG1"]] <- as.normalizedMatrix(
  as(smr_at_SOX10peaks[["BRG1"]], "matrix"),
  k_upstream = 150,
  k_downstream = 150,
  k_target = 0,
  extend = c(3000, 3000),
  signal_name = "BRG1",
  target_name = "SOX10 peak centers"
)

## For SOX10 signal:
normat[["SOX10"]] <- as.normalizedMatrix(
  as(smr_at_SOX10peaks[["SOX10"]], "matrix"),
  k_upstream = 150,
  k_downstream = 150,
  k_target = 0,
  extend = c(3000, 3000),
  signal_name = "SOX10",
  target_name = "SOX10 peak centers"
)
  • Plot BRG1 heatmap using default parameters

Show code: Plot BRG1 EnrichedHeatmap

## BRG1 EnrichedHeatmap
EnrichedHeatmap(normat$BRG1, name = "BRG1")
We notice that the heatmap is sorted by increasing average signal across the regions.
  • Now use the syntax of EnrichedHeatmap to plot both SOX10 and BRG1 heatmaps

Show code: Plot SOX10 + BRG1 EnrichedHeatmaps

## SOX10 + BRG1 EnrichedHeatmap
EnrichedHeatmap(normat$SOX10, name = "SOX10") +
  EnrichedHeatmap(normat$BRG1, name = "BRG1")