During this training session, we are going to use ChIP-seq and RNA-seq data from Laurette et al. (Laurette et al. 2015) which are available in GEO as GSE61967. Data have already been processed.
Reads were preprocessed in order to remove adapter, polyA and low-quality sequences (Phred quality score below 20). After this preprocessing, reads shorter than 40 bases were discarded for further analysis. These preprocessing steps were performed using cutadapt (Martin 2011) version 1.10.
Reads were mapped onto the hg38 assembly of Homo sapiens genome using STAR (Dobin et al. 2013) version 2.5.3a.
Gene expression quantification was performed from uniquely aligned reads using htseq-count version 0.6.1.p1 (S. Anders, Pyl, and Huber 2015), with annotations from Ensembl version 103 and “union” mode. Only non-ambiguously assigned reads have been retained for further analyses.
Read counts have been normalized across samples with the median-of-ratios method proposed by Anders and Huber (Anders and Huber 2010), to make these counts comparable between samples.
Differential expressed genes between wild type ans shBRG1 cells was
performed using the edgeR
R package (Robinson, McCarthy, and Smyth 2010). We called
significant changes when FDR < 0.01, absolute log fold change over 1
and minimum average log normalized count over 5.
Reads were mapped to Homo sapiens genome (assembly hg38) using Bowtie (Langmead et al. 2009) v1.0.0 with default parameters except for “-p 3 -m 1 –strata –best –chunkmbs 128”. The following table shows the number of reads aligned to the Homo sapiens genome.
Sample ID | Sample name | raw reads | aligned reads | multimapped reads | unmapped reads |
---|---|---|---|---|---|
SRR1594290 | MITF_CHIP-seq | 84856252 | 54369181 (64.07%) | 12823841 (15.11%) | 17663230 (20.82%) |
SRR1594291 | SOX10_CHIP-seq | 38202152 | 29594694 (77.47%) | 6128890 (16.04%) | 2478568 (6.49%) |
SRR1594292 | BRG1siCTRL_CHIP-seq | 50190992 | 40134919 (79.96%) | 6403809 (12.76%) | 3652264 (7.28%) |
SRR1594293 | BRG1siMITF_CHIP-seq | 64765675 | 44480241 (68.68%) | 9357361 (14.45%) | 10928073 (16.87%) |
SRR1594294 | GFPsiCTRL_CHIP-seq | 38148419 | 26273410 (68.87%) | 5722823 (15.00%) | 6152186 (16.13%) |
SRR1594295 | MITF_Input | 29433042 | 19970925 (67.85%) | 4049711 (13.76%) | 5412406 (18.39%) |
SRR1594296 | SOX10_Input | 35449561 | 27381422 (77.24%) | 7045192 (19.87%) | 1022947 (2.89%) |
SRR1596499 | BRG1siSOX10_CHIP-seq | 42745544 | 34418403 (80.52%) | 6855741 (16.04%) | 1471400 (3.44%) |
Prior to peak calling, reads falling into Encode blacklisted regions (“(2014) Mod/Mouse/humanENCODE: Blacklisted Genomic Regions for Functional Genomics Analysis - Anshul Kundaje” n.d.) were removed using bedtools intersect v2.26.0 (Quinlan and Hall 2010). Then peak calling was done with Macs2 v2.1.1 with default parameters.
IP sample | Input sample | No. of peaks |
---|---|---|
BRG1siCTRL_CHIP-seq | GFPsiCTRL_CHIP-seq | 72024 |
BRG1siMITF_CHIP-seq | GFPsiCTRL_CHIP-seq | 33984 |
BRG1siSOX10_CHIP-seq | GFPsiCTRL_CHIP-seq | 73267 |
MITF_CHIP-seq | MITF_Input | 9702 |
SOX10_CHIP-seq | SOX10_Input | 4538 |
Normalized BigWig files were generated using Homer (Heinz et al. 2010) makeUCSCfile v4.11.0 with the following parameter ‘-norm 1e7’ meaning that data were normalized to 10M reads.
Before starting any project, it is a good idea to setup the path to the project as a variable. For example, we could use:
projPath <- "/shared/projects/2514_ebiii_n2/chipseq"
## If you are not using RStudio, you can change the working directory with:
setwd(projPath) ##NOT RUN
If you use an Rmd file and RStudio, when you open the file, your working directory should already be set as the folder where this Rmd file is located. You can check it with:
Here, we will use different files for our analyses. We can create a list containing the absolute paths or names to the different files and folders that we will use during the practice session.
params <- list(
workdir = "/shared/projects/2514_ebiii_n2/chipseq",
mapping.chipseq = "stats_mappingChIPseq.tsv",
peakcalling.chipseq = "stats_peakCalling.tsv",
brg1 = "BRG1siCTRL_CHIP-seq_peaks.narrowPeak",
mitf = "MITF_CHIP-seq_peaks.narrowPeak",
sox10 = "SOX10_CHIP-seq_peaks.narrowPeak",
brg1.bw = "BRG1siCTRL_CHIP-seq.bw",
mitf.bw = "MITF_CHIP-seq.bw",
sox10.bw = "SOX10_CHIP-seq.bw",
rnaseq = "RNAseq_diff_norm.rds",
annot = "annot.rds"
)
Peak files are in narrowPeak format which is of the form:
There are different ways to import such data. We could use a simple read.table
function but we know that it would not convert the 0-based coordinates to the
1-based coordinates used in R/Bioconductor.
The ChIPseeker
package provides a readPeakFile
function that we could use like this:
##NOT RUN
library(ChIPseeker)
peaks <- list()
peaks[["BRG1"]] <- readPeakFile(file.path(params$workdir,"data", "ChIPseq", params$brg1),
as="GRanges")
However, the function is essentially a wrapper to the read.delim
function and
does not assign useful column names to the imported data. So instead we will use
the readNarrowPeak
function from the
genomation
package to import the narrowPeak files obtained for the 3 factors (BRG1, MITF and SOX10):
library(genomation)
peaks <- list()
peaks[["BRG1"]] <- genomation::readNarrowPeak(file.path(params$workdir,
"data", "ChIPseq",
params$brg1))
peaks[["MITF"]] <- readNarrowPeak(file.path(params$workdir,
"data", "ChIPseq",
params$mitf))
peaks[["SOX10"]] <- readNarrowPeak(file.path(params$workdir,
"data", "ChIPseq",
params$sox10))
The package also provides a readGeneric
function that can adapt to different
peak file formats. For example, the following command is equivalent to what we
did above for BRG1:
## NOT RUN
peaks[["BRG1"]] <- readGeneric(file.path(params$workdir,"data", "ChIPseq", params$mitf),
meta.cols = list(name = 4, score = 5, signalValue = 7,
pValue = 8, qValue = 9, peak = 10),
zero.based = TRUE)
Let’s take a look at how the data was imported:
## GRanges object with 72024 ranges and 6 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <character> <integer>
## [1] chr1 980402-981178 * | BRG1siCTRL_CHIP-seq_.. 77
## [2] chr1 983476-984526 * | BRG1siCTRL_CHIP-seq_.. 108
## [3] chr1 1000729-1001179 * | BRG1siCTRL_CHIP-seq_.. 90
## [4] chr1 1001762-1002054 * | BRG1siCTRL_CHIP-seq_.. 40
## [5] chr1 1020914-1021207 * | BRG1siCTRL_CHIP-seq_.. 114
## ... ... ... ... . ... ...
## [72020] chrY 18936757-18937116 * | BRG1siCTRL_CHIP-seq_.. 37
## [72021] chrY 19577774-19578316 * | BRG1siCTRL_CHIP-seq_.. 73
## [72022] chrY 19891214-19892094 * | BRG1siCTRL_CHIP-seq_.. 114
## [72023] chrY 19892553-19893192 * | BRG1siCTRL_CHIP-seq_.. 55
## [72024] chrY 21837647-21838039 * | BRG1siCTRL_CHIP-seq_.. 55
## signalValue pvalue qvalue peak
## <numeric> <numeric> <numeric> <integer>
## [1] 5.81321 10.22971 7.71662 166
## [2] 5.97792 13.59661 10.83201 384
## [3] 6.65677 11.62894 9.01201 199
## [4] 3.78100 6.17367 4.02109 209
## [5] 8.11634 14.29797 11.46427 158
## ... ... ... ... ...
## [72020] 4.53175 5.93549 3.79936 167
## [72021] 6.32404 9.87728 7.38650 165
## [72022] 7.86030 14.29797 11.46427 711
## [72023] 5.29987 7.83781 5.50942 390
## [72024] 5.55592 7.83781 5.50942 152
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
Peaks are stored as GenomicRanges (or GRanges) objects; this is an R format which looks like the bed format, but is optimized in terms of memory requirements and speed of execution.
NOTE:: when using GRanges it is always a good idea to set the
seqinfo
metadata (info about chromosome names and length essentially). We will do that later below.
We can start by computing some basic statistics on the peak sets.
Compute the number of peaks per dataset. We use here a function from the
apply
family which help to apply recursively a given function on
elements of the object. We will often use these functions along the
course.
The sapply()
function typically takes a list or vector as input and
applies the same function (here length) to each element:
## BRG1 MITF SOX10
## 72024 9702 4538
Note that in this specific case, base R provides a lengths
function (lengths
with an s at the end) that does the same, i.e. gives the length of each
element of a list object:
Make a simple barplot
showing the number of BRG1 peaks.
Show code: barplot
Let’s create a barplot with ggplot2 out of this data.
Step by step:
Do not forget to load the ggplot2 library
Show code: create the data.frame
geom_*
Show code: barplot with ggplot2
Show code: modify axes labels and theme
#make the barplot a bit nicer:
ggplot(peak.lengths, aes(x = IP, y = NbPeaks)) +
geom_col() +
labs(x = "ChIP", y = "Number of peaks") +
theme_bw()
We can customize the plots by adding colors.
There are a number of color palettes that you can use in R and different ways to specify colors. For example, the Rcolorbrewer package is widely used and offers several color palettes:
The ggsci package offers color palettes based on scientific journals.
If you are searching for a specific color, take a look at this site.
Now lets add colors to the barplot.
Step by step:
- Add the new information fill=IP to let ggplot know that colors change
based on chipped protein
Show code: color according to the chipped TF (column IP)
ggplot(peak.lengths, aes(x = IP, y = NbPeaks, fill = IP)) +
geom_col() +
labs(x = "ChIP", y = "Number of peaks") +
theme_bw()
We want to use colors from RColorBrewer
library with the “Set1” color
palette. To set your own colors to fill the plot, several functions are
available (scale_fill_*
), here we use scale_fill_brewer
.
Show code: color according to the chipped TF (column IP) and set RColorBrewer palette to Set1
ggplot(peak.lengths, aes(x = IP, y = NbPeaks, fill = IP)) +
geom_col()+
scale_fill_brewer(palette="Set1") +
labs(x = "ChIP", y = "Number of peaks") +
theme_bw()
And now use one of the ggsci
palette.
The peaks were loaded as a GRanges
objects.
- To manipulate them we load the GenomicRanges
library. Then we use the
appropriate function to retrieve the width of the peaks from the
GRanges
and use a function to summarize the peak length (display the
statistics like minimum, maximum, quartiles, etc.)
Show code: Display peak length statistics for BRG1 only
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 254.0 308.0 427.0 578.7 671.0 9186.0
Create a simple boxplot
of the peak sizes.
Show code: Display peak length statistics as a boxplot
# use a log scale on the y axis to facilitate the comparisons
peak.width <- lapply(peaks, width)
boxplot(peak.width, ylab = "peak width", log = "y")
Now, create a nice looking boxplot with ggplot2
. ggplot
takes a data
frame as input. We can either create a date frame, this is what we have
done when we created the barplot. Here, we are going to use the package
reshape2
which, among all its features, can create a data frame from
other types of data, in particular list objects, via the melt
function:
## value L1
## 1 777 BRG1
## 2 1051 BRG1
## 3 451 BRG1
## 4 293 BRG1
## 5 294 BRG1
## 6 351 BRG1
Create a ggplot object with correct aesthetics to display a boxplot
according the chipped TF (x-axis) and their width (y-axis). Then use
appropriate geometric object geom_*
.
Show code: Display peak length statistics with ggplot2
By default, the background color is grey and often does not allow to
highlight properly the graphs. Many theme
are available defining
different graphics parameters, they are often added with a function
theme_*()
, the function theme()
allows to tune your plot parameter
by parameter. We propose here to use theme_classic()
that changes a
grey background to white background. Some peaks are very long and
squeeze the distribution, one way to make the distribution easier to
visualize is to transform the y axis to log scale. Finally, we want to
remove the x label, change the y label to ‘Peak widths’ and the legend
relative to the fill color from “L1” to “TF”.
Using the ggplot2 documentation
(e.g. tidyverse
or google it,
try to enhance the boxplot.
Show code: Enhance it !
# - theme_classic() change grey background to white background
# - fill=L1 and scale_fill_brewer(palette="Set1") colors boxplots
# based on chipped protein and with colors from RColorBrewer Set1 palette
# - labs changes x and y axis labels and legend title
# - scale_y_log10() set y axis to a log scale so that we can have a nice
# view of the data in small values
ggplot(peak.width.table, aes(x = L1, y = value, fill = L1)) +
geom_boxplot()+
theme_classic()+
scale_fill_brewer(palette = "Set1")+
labs(x = "", y = "Peak sizes", fill = "TF")+
scale_y_log10()
To make sure we keep only high quality data. We are going to select
those peaks having a \(qvalue >= 4\). In the GRanges metadata columns (mcols
)
one of them is the qvalue. So, we are going to set a threshold on this column.
How many peaks are selected ?
Show the code: select high quality peaks for BRG1 and count them
## [1] 60184
Now we want to apply this function to all IPs.
Sometime, peaks may occur more on some chromosomes than others or there can
be regions where peaks are absent (example: repetitive regions that have been
filtered out). We can display the genomic distribution of peaks along the
chromosomes, using the covplot
function from ChIPSeeker
.
Height of peaks is drawn based on the peak scores.
Let’s use this function on SOX10 for example:
Here, the aim is to identify where the peaks tend to be located relative to
typical genome annotations such as promoters, exons, introns, intergenic
regions, etc.
We will assign peaks to genomic features (genes, introns,
exons, promoters, distal regions, etc.) by checking if they overlap with these
features.
We will illustrate this task using functions from the ChIPseeker package.
But first we need to retrieve the annotations
We load ready-to-use annotation objects from Bioconductor:
org.Hs.eg.db
andTxDb.Hsapiens.UCSC.hg38.knownGene
## org.Hs.eg.db is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.
library(org.Hs.eg.db)
## Load transcript-oriented annotations
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
Since we now have a TxDb object that contains the seqinfo
metadata, we take
the opportunity to transfer this metadata to our peaks.
This is done using the function annotatePeak
which compares peak
positions with the coordinates of genomic features on the same reference genome.
This function returns a complex object which contains all this
information.
As we did for the peaks coordinates, we store the peak annotations
in a list that we first initialize as an empty list
peakAnno <- list()
.
Here, we define the TSS regions -1000 bp to +100 bp from the TSS itself.
Using the help of the function and annotation objects
we have loaded, try to write the correct command line to annotate BRG1
peaks.
Show code: annotate BRG1 peaks
## Annotate peaks for all datasets and store it in a list
## Here TSS regions are regions -1000Kb/+100b arount TSS positions
## Peak annotations are stored in a list
peakAnno <- list()
peakAnno[["BRG1"]] = annotatePeak(peakfilt$BRG1,
tssRegion = c(-1000, 100),
TxDb = txdb,
annoDb = "org.Hs.eg.db")
## >> preparing features information... 2025-05-29 11:10:43 AM
## >> identifying nearest features... 2025-05-29 11:10:44 AM
## >> calculating distance from peak to TSS... 2025-05-29 11:10:46 AM
## >> assigning genomic annotation... 2025-05-29 11:10:46 AM
## >> adding gene annotation... 2025-05-29 11:11:12 AM
## >> assigning chromosome lengths 2025-05-29 11:11:12 AM
## >> done... 2025-05-29 11:11:12 AM
## [1] "csAnno"
## attr(,"package")
## [1] "ChIPseeker"
## Visualize and export annotation as a data table
# as.data.frame(peakAnno$BRG1)
head(as.data.frame(peakAnno$BRG1))
## seqnames start end width strand name score
## 1 chr1 980402 981178 777 * BRG1siCTRL_CHIP-seq_peak_1 77
## 2 chr1 983476 984526 1051 * BRG1siCTRL_CHIP-seq_peak_2 108
## 3 chr1 1000729 1001179 451 * BRG1siCTRL_CHIP-seq_peak_3 90
## 4 chr1 1001762 1002054 293 * BRG1siCTRL_CHIP-seq_peak_4 40
## 5 chr1 1020914 1021207 294 * BRG1siCTRL_CHIP-seq_peak_5 114
## 6 chr1 1059581 1059989 409 * BRG1siCTRL_CHIP-seq_peak_8 68
## signalValue pvalue qvalue peak
## 1 5.81321 10.22971 7.71662 166
## 2 5.97792 13.59661 10.83201 384
## 3 6.65677 11.62894 9.01201 199
## 4 3.78100 6.17367 4.02109 209
## 5 8.11634 14.29797 11.46427 158
## 6 5.43367 9.31879 6.86163 239
## annotation geneChr geneStart geneEnd
## 1 5' UTR 1 975198 982093
## 2 Distal Intergenic 1 976121 982117
## 3 Promoter 1 1001138 1014540
## 4 Intron (ENST00000624697.4/9636, intron 1 of 2) 1 1001145 1014435
## 5 Intron (ENST00000379370.7/375790, intron 1 of 35) 1 1020123 1056118
## 6 Promoter 1 1059687 1066449
## geneLength geneStrand geneId transcriptId distanceToTSS
## 1 6896 2 84808 ENST00000433179.4 915
## 2 5997 2 84808 ENST00000694917.1 -1359
## 3 13403 1 9636 ENST00000624697.4 0
## 4 13291 1 9636 ENST00000624652.1 617
## 5 35996 1 375790 ENST00000620552.4 791
## 6 6763 1 100288175 ENST00000688131.1 0
## ENSEMBL SYMBOL GENENAME
## 1 ENSG00000187642 PERM1 PPARGC1 and ESRR induced regulator, muscle 1
## 2 ENSG00000187642 PERM1 PPARGC1 and ESRR induced regulator, muscle 1
## 3 ENSG00000187608 ISG15 ISG15 ubiquitin like modifier
## 4 ENSG00000187608 ISG15 ISG15 ubiquitin like modifier
## 5 ENSG00000188157 AGRN agrin
## 6 ENSG00000291156 LOC100288175 uncharacterized LOC100288175
All peak information contained in the peak list is retained in the
output of annotatePeak
. Positions and strand information of the nearest
genes are reported. The distance from the peak to the TSS of its nearest
gene is also reported. The genomic region where the peak lies is reported in
the annotation column. Since some annotation may overlap (e.g. a 5’ UTR is
always located in an exon), ChIPseeker adopted the following priority in
genomic annotations:
This hierachy can be customized using the parameter genomicAnnotationPriority.
The annotatePeak function reports detailed information when the annotation is Exon or Intron, for instance “Intron (ENST00000624697.4/9636, intron 1 of 2)” means that the peak is overlapping with the first intron of transcript ENST00000624697.4, whose corresponding Entrez gene ID is 9636 (Transcripts that belong to the same gene ID may differ in splice events).
The “annoDb” parameter is optional. If it is provided, some extra columns including SYMBOL, GENENAME, ENSEMBL/ENTREZID are added.
–
Reminder: The TxDb class is a container for storing transcript annotations.
Bioconductor provides several packages containing TxDb objects for
model organisms sur as Human and mouse. For instance,
TxDb.Hsapiens.UCSC.hg38.knownGene
, TxDb.Hsapiens.UCSC.hg19.knownGene
for human genome hg38 and hg19, TxDb.Mmusculus.UCSC.mm10.knownGene
and TxDb.Mmusculus.UCSC.mm9.knownGene
for mouse genome mm10 and mm9,
etc.
The user can also prepare their own TxDb by retrieving information from
UCSC Genome Bioinformatics and BioMart data resources using the R functions
makeTxDbFromBiomart
and makeTxDbFromUCSC
that are now in the
txdbmaker
package.
One can also create a TxDb objects for any organism using
an annotation file in GTF/GFF format using the function
makeTxDbFromGFF
of the txdbmaker package
(GenomicFeatures package in older R versions).
Expand to find Coturnix japonica example
## NOT RUN
## download GTF file
download.file("https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/93934/101/GCF_001577835.2_Coturnix_japonica_2.1/GCF_001577835.2_Coturnix_japonica_2.1_genomic.gtf.gz", "Coturnix_japonica_2.1.annotation.gtf.gz")
## Build TxDb object
library(GenomicFeatures)
txdb = makeTxDbFromGFF("Coturnix_japonica_2.1.annotation.gtf.gz")
## To save the txdb database
library(AnnotationDbi)
saveDb(txdb, 'txdb.Coturnix_japonica_2.1.sqlite')
## load it when needed
library(AnnotationDbi)
txdb = loadDb(file = 'txdb.Coturnix_japonica_2.1.sqlite')
–
We can vizualize the way our peaks have been annotated relative to genomic
features (introns, exons, promoters, distal regions,…)
using a pie chart with plotAnnoPie
or as a bar chart plotAnnoBar
which
facilitates the comparison between different sets of peaks.
Show code: Display peak annotations of BRG1 peaks using a pie chart
## distribution of genomic features for BRG1 peaks
# as a pie chart - which is the most widely used representation in publication
plotAnnoPie(peakAnno$BRG1)
Now we generate annotations for SOX10 and MITF peaks.
Show code: Generate annotations for SOX10 and MITF peaks
## Use the annotatePeak function
peakAnno[["SOX10"]] <- annotatePeak(peakfilt$SOX10,
tssRegion = c(-1000, 100),
TxDb = txdb,
annoDb = "org.Hs.eg.db")
## >> preparing features information... 2025-05-29 11:11:16 AM
## >> identifying nearest features... 2025-05-29 11:11:16 AM
## >> calculating distance from peak to TSS... 2025-05-29 11:11:17 AM
## >> assigning genomic annotation... 2025-05-29 11:11:17 AM
## >> adding gene annotation... 2025-05-29 11:11:20 AM
## >> assigning chromosome lengths 2025-05-29 11:11:20 AM
## >> done... 2025-05-29 11:11:20 AM
peakAnno[["MITF"]] <- annotatePeak(peakfilt$MITF,
tssRegion = c(-1000, 100),
TxDb = txdb,
annoDb = "org.Hs.eg.db")
## >> preparing features information... 2025-05-29 11:11:20 AM
## >> identifying nearest features... 2025-05-29 11:11:20 AM
## >> calculating distance from peak to TSS... 2025-05-29 11:11:20 AM
## >> assigning genomic annotation... 2025-05-29 11:11:20 AM
## >> adding gene annotation... 2025-05-29 11:11:23 AM
## >> assigning chromosome lengths 2025-05-29 11:11:24 AM
## >> done... 2025-05-29 11:11:24 AM
Now display the results as a stacked barplot using peakAnnoBar
.
Show code: Generate stacked barplot of peak annotations
However since some annotation overlap, ChIPseeker also provides functions
that help evaluate full annotation overlaps.
For example, the upsetplot
function which produces an
UpSet plot similar to those
produced by the library(UpSetR)
package.
We see that the peaks for the 3 factors SOX10, MITF and BRG1 are present to a certain extent in similar regions (although this could be due to the proportions of these regions in the genome). But this doesn’t tell us if the peaks actually overlap. An overlap between the peaks may mean that the factors co-localize, although we won’t know if it’s the case in single cells.
Overlaps between peak datasets can be evaluated using a Venn diagram. Such an approach is implemented in the ChIPpeakAnno package.
After loading the library, we start by computing the overlap between peak sets,
keeping the information of all peaks overlapping in each set
(see ?findOverlapsOfPeaks
for help).
ChIPpeakAnno imposes, while plotting the Venn diagram, to compute the
significance of the pairwise associations using a hypergeometric test.
To this end, we need to estimate the number of all potential binding
events which is used by the makeVennDiagram
function through the
totalTest
number. It is used for the hypergeometric sampling that is
used to determine if the overlap between two datasets is more than would
be expected by chance. This is not a trivial question, the answer is
driven by what you know about the binding properties of your factors
(eg. sequence specific, mainly intergenic etc). You can find an
interesting discussion
here.
In our case we can refer to the genomic distribution of the peaks that
we have plotted previously. We can assume that our TFs have a gene body
binding preference. Genes cover roughly 10% of the genome.
# Estimate the average size of the peaks ...
averagePeakWidth <- mean(width(unlist(GRangesList(ovl$peaklist))))
# ... to count how many potential sites could have been bound in coding regions.
tot <- ceiling(3.3e+9 * 0.1 / averagePeakWidth)
TIPS: We can define the colors attributed to each set using the
colours()
function. Whenever you want to set a color you can use this function.
Here are some useful pages about colors in R and elsewhere:
Now we can plot the overlap of the peaks as a Venn Diagram:
makeVennDiagram(ovl, totalTest = tot, connectedPeaks = "keepAll",
fill = brewer.pal(3, "Set1"), # circle fill color
col = brewer.pal(3, "Set1"), #circle border color
cat.col = brewer.pal(3, "Set1"))
According to the hypergeometric test p-values all pairwise comparisons
are highly significant. This has to be taken very carefully as it
largely depends on the background estimation that we may have
over-estimated. If you are not sure about your estimation, you can use a
non-parametric approach based on your peak genomic
distribution to estimate randomness. To this end the
TxDb.Hsapiens.UCSC.hg38.knownGene
is used.
ChIPpeakAnno provides the preparePool
and peakPermTest
functions to
compute this test. These tests are made by pair, we will look at
SOX10/MITF overlap significance as an example.
# Prepare a pool of random peaks following the characteristics of our peak sets
pool <- preparePool(txdb, peaks$SOX10,
bindingType = "TSS",
featureType = "transcript",
seqn = paste0("chr",c(1:22, "X", "Y")))
# Create the permPool object needed for peakPermTest
pool <- new("permPool", grs=pool$grs[1], N=length(peaks$SOX10))
SOX10.MITF <- peakPermTest(peaks$SOX10, peaks$MITF,
pool = pool,
seed = 123,
force.parallel = FALSE)
SOX10.MITF
## $cntOverlaps
## P-value: 0.0099009900990099
## Z-score: 149.1501
## Number of iterations: 100
## Alternative: greater
## Evaluation of the original region set: 535
## Evaluation function: cntOverlaps
## Randomization function: randPeaks
##
## attr(,"class")
## [1] "permTestResultsList"
Venn diagrams are widely used to represent overlaps, intersections. However, in ChIP-seq analysis, the definition of the peaks is dependent on the peak caller, the FDR thresholds (what about peaks just below the threshold?). The overlap is also difficult to assess, indeed do we call an overlap of 1 nucleotide an intersection ? It is one of the numerous parameters that can be tuned…
Another way to explore the results is to evaluate if different combinations of
the factors are associated to specific genomic features. We can use again the
annotatePeak
function to evaluate this:
coocs <- as(ovl$peaklist, "GRangesList")
# apply annotatePeak to each set of peaks in the list coocs
peakAnnoList <- lapply(coocs, annotatePeak,
TxDb = txdb, annoDb = "org.Hs.eg.db",
tssRegion = c(-1000, 100),
verbose = FALSE)
And then plot the results:
Heatmaps and average profiles are widely used representations of ChIP-seq data as they allow simultaneous visualization of read enrichment at various locations. For instance, one may want to represent reads related to a chipped protein in regions spanning +/-5Kb around all TSS of the reference genome. Another objective could be to compare read enrichment at the same locations in many chip-seq datasets.
We want to know if BRG1 is binding large and/or narrow regions, unique and/or tandem etc. So we will study the distribution of the BRG1 peaks that have been called in a region spanning +/-2Kb around the peak centers.
First of all, for calculating the profile of ChIP peaks around the BRG1 peak center, we need to define BRG1 peak centers and extend them 2Kb on each side. We will then select the 10000 best peaks using the q-value to get a lighter matrix. Then, we need to “align” the peaks that are mapped to these regions, and thus generate a matrix with the different regions in row and the position around the peak centers in columns. The resulting matrix will be 2000*2+1 columns and 10000 rows.
Step by step:
- Compute the centers of BRG1 peaks using the GenomicRanges `resize`` function.
Show code: compute the peak centers for BRG1
# Compute the centers of BRG1 peaks
peak_centers_BRG1 <- resize(peakfilt$BRG1, width = 1, fix = "center")
Note that the mid
function applied to a GRanges object returns the centers
of the ranges but as an integer vector, not a GRanges object.
Show code: Select peak centers for the top 10k peaks
GRanges
object with the regions corresponding to +/-2kb around
these peak centers.Show code: Create GRanges with +/-2kb around top10k BRG1 peak centers
First, we will use the functions available in the ChIPseeker package to generate the heatmap. Later, we will compare the results to those obtained with the genomation package.
makeBioRegionFromGranges
.Show code: prepare the “BioRegions” using ChIPseeker function
getTagMatrix
function and the parameter
windows
.Show code: Create the matrix
## compute the coverage of peaks in these regions
BRG1_tagmat <- getTagMatrix(peakfilt$BRG1,
windows = BRG1_extend_2kb)
## >> preparing start_site regions by peaks... 2025-05-29 10:56:26 AM
## >> preparing tag matrix... 2025-05-29 10:56:26 AM
tagHeatmap
.We will see below that the rows have been reordered by their average signal
(i.e. what you would get by using rowMeans
on the tagMatrix).
Here we used the centers of BRG1 peaks to define the regions of interest, so it is not surprising that peaks are nicely aligned on these centers and, because of the sorting applied to the rows, that wide peaks are at the top and narrower peaks at the bottom of the heatmap.
peakHeatmap
that uses directly the TxDb object that we used previously.Show code: Plot BRG1 peak distribution around the TSS
## plot the heatmap of BRG1 peaks around the TSS
peakHeatmap(peakfilt$BRG1, TxDb = txdb,
upstream = 2000, downstream = 2000)
## >> preparing promoter regions... 2025-05-29 11:12:09 AM
## >> preparing tag matrix... 2025-05-29 11:12:09 AM
## >> preparing start_site regions by gene... 2025-05-29 11:12:10 AM
## >> preparing tag matrix... 2025-05-29 11:12:10 AM
## >> generating figure... 2025-05-29 11:12:17 AM
## >> done... 2025-05-29 11:12:20 AM
We can also display the average profile of BRG1 peaks across these regions.
This corresponds to calculating the mean
signal of each column of the
matrix and plotting the result. So these “average profiles” are sometimes
plotted above the heatmap.
In ChIPseeker we can do this on the tagMatrix object using the
plotAvgProf
function.
Show code: Plot the average profile of BRG1 peaks around their centers
## compute the coverage of peaks in these regions
plotAvgProf(BRG1_tagmat,
xlim = c(-2000, 2000),
xlab = "Distance to peak center",
ylab = "Peak Count Frequency")
## >> plotting figure... 2025-05-29 11:14:31 AM
NOTE:: this is especially useful when groups of regions have been defined on the matrix, for example representing different binding profiles of the factor.
Now let’s see how to obtain similar results with the genomation package.
Step by step:
- The equivalent of the tagMatrix
function is SignalMatrix
, but the latter
directly accepts a regular GRanges object to specify the windows / regions
of interest.
Show code: Prepare the matrix
heatMatrix
.You notice that the heatmap doesn’t look like the one generated by the
ChIPseeker function tagHeatmap
. This is because by default, the
heatMatrix
function does not reorder the rows. Since we have sorted the
rows by decreasing qvalue, we can verify this by using the qvalue as a weight
when calculating the coverage of BRG1 peaks over the regions. This is done by
using the weight.col
argument in ScoreMatrix
.
weight.col
argument in ScoreMatrix
.Show code: Plot the heatmap with qvalue as weight
## Plot the heatmap using qvalue as weight
heatMatrix(ScoreMatrix(target = peakfilt$BRG1,
windows = top10k_BRG1_extend2kb,
weight.col = "qvalue"),
xcoords = c(-2000, 2000))
order
argument in heatMatrix
to reorder the rowsShow code: Plot the heatmap reordering the rows
Take a look at ?heatMatrix
and at the genomation vignette, there are other
useful arguments, for example to make groups or cluster rows.
gene
and promoters
functions
on the TxDb object to extract the coordinates of the regions around the TSS.Show code: Heatmap of BRG1 peaks around the TSS with genomation
## we also add a confidence interval around the mean
heatMatrix(ScoreMatrix(target = peakfilt$BRG1,
windows = promoters(genes(txdb),
upstream = 2000,
downstream = 2001)),
xcoords = c(-2000, 2000),
order = TRUE)
Again, we notice that ChIPseeker function had filtered out all rows with
zero coverage around the TSS, while genomation function does not.
plotMeta
function to create average profilesHeatmaps are useful to compare the profiles obtained from different factors on the same regions. For examples, we could ask: How is the signal for BRG1 and MITF distributed around SOX10 peaks. genomation has useful functions to analyse a list of several ChIP-seq results, provided as GRanges or RleList objects.
Step by step:
- Get the +/-2kb regions around SOX10 peak centers.
Show code: Get +/-2kb regions around SOX10 peak centers
ScoreMatrixList
function to obtain the profiles of all factors in
these regions.Show code: Get the ScoreMatrix for each factors
multiHeatMatrix
.Show code: Plot the 3 heatmaps side by side
## Get the coverage of peaks for all factors around SOX10 peak centers
## We order the rows by the average signal across all matrices and use a common color scale for all heatmaps
multiHeatMatrix(sm_sox10peaks,
xcoords = c(-2000, 2000),
order = TRUE,
common.scale = TRUE)
There seems to be little link between SOX10 and MITF binding but there is some link between BRG1 binding and SOX10
So far, we’re only looking at heatmaps of peak presence/absence. Even if we
sort the regions by their intensity or by qvalue, this gives little
quantitative information on the actual links between the factors.
To dig into this, we will look at the coverage of reads, instead of peaks,
in specific regions.
We will first look at the enrichment of reads from the BRG1 ChIP in regions covering +/-5kb around BRG1 peak centers. Because this represent ~10kb regions we will create bins and summarize peak density within these bins.
Step by step:
- Get the +/-5kb regions around BRG1 peak centers (only the top10K peaks).
Show code: Get +/-5kb regions around BRG1 peak centers
Show code: Import the bigwig file for BRG1 using rtracklayer import
function
## Import BRG1 bigwig file as an RleList
cvg.BRG1 <- import(file.path(params$workdir, "data", "ChIPseq", params$brg1.bw),
format = "BigWig",
which = extended.5K.BRG1,
as = "RleList")
cvg.BRG1
## RleList of length 25
## $chr1
## numeric-Rle of length 248956422 with 1776460 runs
## Lengths: 975789 11 5 35 2 ... 1288 73 36 22594
## Values : 0.00 0.32 0.81 1.13 0.97 ... 0.00 0.17 0.33 0.00
##
## $chr10
## numeric-Rle of length 133797422 with 693950 runs
## Lengths: 270215 30 64 37 6 ... 90 58 154 272220
## Values : 0.00 0.65 0.97 0.81 0.65 ... 0.65 0.00 0.33 0.00
##
## $chr11
## numeric-Rle of length 135086622 with 740549 runs
## Lengths: 202802 21 117 221 41 ... 9 70 154 74826
## Values : 0.00 0.49 0.16 0.00 0.16 ... 0.16 0.00 0.16 0.00
##
## $chr12
## numeric-Rle of length 133275309 with 769359 runs
## Lengths: 159158 105 616 25 129 ... 24 372 154 89209
## Values : 0.00 0.16 0.00 0.16 0.49 ... 0.16 0.00 0.16 0.00
##
## $chr13
## numeric-Rle of length 114364328 with 467206 runs
## Lengths: 18172785 58 1 21 ... 44 110 40729
## Values : 0.00 0.16 0.65 0.97 ... 0.49 0.16 0.00
##
## ...
## <20 more elements>
The read coverage is imported as an RleList
where each element represents a
chromosome and is a long vector of coverage values along this chromosome,
represented as a read-length encoded (Rle
) vector.
Here, we used the rtracklayer import
function but below we will also
illustrate the readBigWig
function from the genomation package.
Show code: Use the ScoreMatrixBin
function from genomation tu summarize the
coverage in 200 bins
Show code: Use the heatMatrix
function to plot the heatmap
## Plot the heatmap, ordering the regions by their average intensity
heatMatrix(smr_BRG1,
xcoords = c(-5000, 5000),
order = TRUE)
winsorize
argument
in heatMatrix
to correct the color scale.Show code: Use the winsorize
argument in heatMatrix
to adjust the color scale
## Plot the heatmap, ordering the regions by their average intensity
heatMatrix(smr_BRG1,
xcoords = c(-5000, 5000),
winsorize = c(0, 99.9),
order = TRUE)
scaleScoreMatrix
function from genomation to see the effect on
the heatmapShow code: Plot the heatmap of a scaled matrix
## Plot the heatmap for the scaled matrix
heatMatrix(scaleScoreMatrix(smr_BRG1),
xcoords = c(-5000, 5000),
order = TRUE)
clustfun
argumentShow code: Use the clustfun
argument to group the peaks based on their
patterns of read coverage by using k-means clustering (e.g. 5 groups)
## Make 5 groups of peaks based on the pattern of read coverage around the peaks
set.seed(123) # for reproducibility since we're using k-means clustering
heatMatrix(smr_BRG1,
xcoords = c(-5000, 5000),
winsorize = c(0, 99.9),
clustfun = function(x) kmeans(x, centers=5)$cluster)
In practice we would probably prefer to perform the clustering outside of the
call to the
heatMatrix
function. But when the clustfun
argument is used,
the function returns the clustering result so that we can reuse it for other
analysis.
Now let’s investigate the read coverage of the 3 factors around SOX10 peaks. We will choose a window of +/-3kb around SOX10 peaks and sort them by decreasing qvalue. We will also summarize the coverages across these regions using 300 bins.
Step by step:
- Get the +/-3kb regions around SOX10 peak centers sorted by decreasing qvalues.
Show code: Get +/-3kb regions around sorted SOX10 peak centers
readBigWig
function from
the genomation package.Show code: Import read coverage for the 3 factors
## Import read coverage for the 3 factors
cvg <- list()
cvg[["SOX10"]] <- readBigWig(file.path(params$workdir, "data", "ChIPseq", params$sox10.bw),
windows = SOX10_extend3kb_sorted)
cvg[["BRG1"]] <- readBigWig(file.path(params$workdir, "data", "ChIPseq", params$brg1.bw),
windows = SOX10_extend3kb_sorted)
cvg[["MITF"]] <- readBigWig(file.path(params$workdir, "data", "ChIPseq", params$mitf.bw),
windows = SOX10_extend3kb_sorted)
ScoreMatrixList
Show code: Compute score matrices for the 3 factors using 300 bins
Show code: Plot the heatmaps using multiHeatMatrix
## Plot the heatmaps side-by-side
multiHeatMatrix(smr_at_SOX10peaks,
xcoords = c(-3000, 3000),
winsorize = c(0, 99.9))
Note that if you use the argument
order = TRUE
, the rows will be sorted by
the average of the signal across the 3 matrices, which is why it is important to
sort the regions of interest before plotting the heatmaps and to use a function
that keeps this row order.
So far, we have used functions from ChIPseeker and genomation to plot the
heatmaps. While these functions offer some level of customization, you may need
further customization in certain cases. The
EnrichedHeatmap
package offers an almost complete level of customization to plot heatmaps and
heatmap annotations from functional genomic experiments.
The package is built on the
ComplexHeatmap package, which
has an extensive documentation and examples.
As an example, let’s plot the enrichment of SOX10 and BRG1 reads around SOX10 peaks.
Step by step:
- First convert the ScoreMatrix objects to normalizedMatrix objects that
are used by EnrichedHeatmap. You can create a normalizedMatrix using the
normalizeToMatrix
function but since we already computed the matrices, we
will instead convert them using as.normalizedMatrix
Show code: Convert a ScoreMatrix to a normalizedMatrix
## Convert signal matrices to normalizedMatrix objects
normat <- list()
## For BRG1 signal:
normat[["BRG1"]] <- as.normalizedMatrix(
as(smr_at_SOX10peaks[["BRG1"]], "matrix"),
k_upstream = 150,
k_downstream = 150,
k_target = 0,
extend = c(3000, 3000),
signal_name = "BRG1",
target_name = "SOX10 peak centers"
)
## For SOX10 signal:
normat[["SOX10"]] <- as.normalizedMatrix(
as(smr_at_SOX10peaks[["SOX10"]], "matrix"),
k_upstream = 150,
k_downstream = 150,
k_target = 0,
extend = c(3000, 3000),
signal_name = "SOX10",
target_name = "SOX10 peak centers"
)
Show code: Plot BRG1 EnrichedHeatmap
Show code: Plot SOX10 + BRG1 EnrichedHeatmaps
## SOX10 + BRG1 EnrichedHeatmap
EnrichedHeatmap(normat$SOX10, name = "SOX10") +
EnrichedHeatmap(normat$BRG1, name = "BRG1")
The first heatmap (SOX10) is ordered by increasing average signal and the second
heatmap (BRG1) follows the same order as the first one so that we can compare
the signal of the 2 factors in the same regions.
Show code: cluster BRG1 signal into 8 groups
Show code: Based on the quantiles of the signal matrices create color scales for
the heatmaps. The color scales are created with the colorRamp2
function from the circlize
package which linearly interpolates colors according to break values.
## Look at the quantiles of the BRG1 matrix
quantile(as(smr_at_SOX10peaks[["BRG1"]], "matrix"), seq(0.95,1,0.01))
## 95% 96% 97% 98% 99% 100%
## 8.8985 9.7420 10.8640 12.4920 15.5355 105.4010
## 95% 96% 97% 98% 99% 100%
## 3.436025 4.040000 4.770000 5.814500 7.652500 162.255500
Show code: EnrichedHeatmaps of SOX10 and BRG1
## Create a vector of the colors that will be used for the clusters
clustcolors <- brewer.pal(5,"Set1")
names(clustcolors) <- 1:5
## Create a legend object for the partition
lgd <- Legend(at = paste0("cluster", 1:5),
title = "Clusters",
type = "lines", legend_gp = gpar(col = clustcolors))
## In the plot we want:
## 1) the partition in cluster 1 to 8
## 2) the heatmap of BRG1 with the average profiles for the different clusters
## 3) the heatmap of SOX10 with the average profiles for the different clusters
## Define the heatmap list with these 3 objects:
ht_list <- Heatmap(part5clust,
col = clustcolors,
cluster_rows = FALSE,
name = " ",
width = unit(3, "mm"),
show_heatmap_legend = FALSE) +
EnrichedHeatmap(normat$BRG1,
name = "BRG1",
col=col_brg1,
top_annotation = HeatmapAnnotation(lines = anno_enriched(gp = gpar(col = clustcolors))),
column_title = "BRG1",
axis_name = c("-3kb", "SOX10\npeak center", "+3kb")) +
EnrichedHeatmap(normat$SOX10,
name = "SOX10",
col=col_sox10,
top_annotation = HeatmapAnnotation(lines = anno_enriched(gp = gpar(col = clustcolors))),
column_title = "SOX10",
axis_name = c("-3kb", "SOX10\npeak center", "+3kb"))
## Draw the heatmap while spliting the rows based on the clusters:
draw(ht_list,
split = part5clust,
annotation_legend_list = list(lgd),
ht_gap = unit(c(2, 8, 8), "mm"))
In this paragraph, we will include gene expression data from RNA-seq in our
analyses.
Let’s import the processed data.
# load RNA-seq normalized expression data
RNAseq <- readRDS(file.path(params$workdir,
"data", "RNAseq",
params$rnaseq)) # RNAseq
head(RNAseq)
## Ensembl.Gene.ID WT.norm shBGR1.norm logCPM logFC FDR
## 1 ENSG00000000003 722.38046 119.10483 4.387012 2.5748948 2.233826e-21
## 2 ENSG00000000419 1481.30280 2514.73393 6.689308 -0.6994493 1.391964e-13
## 3 ENSG00000000457 259.21973 314.27718 5.364018 -0.3167911 3.091766e-05
## 4 ENSG00000000460 333.22470 380.48495 5.562143 -0.1954216 2.875151e-03
## 5 ENSG00000000971 13.42921 13.59959 0.341927 0.6570620 2.536438e-02
## 6 ENSG00000001036 1157.52492 1521.75139 5.567526 -0.3587066 3.117933e-06
## signif
## 1 stable
## 2 stable
## 3 stable
## 4 stable
## 5 stable
## 6 stable
RNA-seq columns:
Reads from RNA-seq were assigned to genes annotated in ensEMBL and thus are identified with ensEMBL ID. However, in the TxDb package that we loaded, the genes are annotated with ENTREZ IDs. Thus, we need to map the identifiers to each other.
NOTE:: it is generally a good idea to stick to a single annotation file to analyze all your data. But sometimes, you get data from other sources (a colleague, a public database, etc.) so you don’t have the choice.
We use the OrgDB
object for the human genome to do this.
The available information are displayed using columns()
.
We then use the function mapIds
from the AnnotationDbi
package with the
keys ENSEMBL
and ENTREZID
. You can see that other identifiers are
available from this object (e.g: SYMBOL, REFSEQ, UNIPROT etc).
Show code: map ENSEMBL IDs to ENTREZ IDs
conv <- mapIds(x = org.Hs.eg.db,
keys = RNAseq$Ensembl.Gene.ID, # what we want to map
keytype = "ENSEMBL", # what type of IDs are the keys
column = "ENTREZID") # which type of ID we want
## 'select()' returned 1:many mapping between keys and columns
## ENSG00000000003 ENSG00000000419 ENSG00000000457 ENSG00000000460 ENSG00000000971
## "7105" "8813" "57147" "55732" "3075"
## ENSG00000001036
## "2519"
The warning indicates that there is not a perfect one-to-one match between ENSEMBL ids and ENTREZ ids.
Let’s take a closer look at the mappings.
Show code: Explore the conv object
## [1] 19978
## [1] 4218
## [1] 15760
length(unique(conv[!is.na(conv)])) # Number of unique ENTREZ IDs that corresponded to at least one ENSEMBL ID
## [1] 15749
table(conv[!is.na(conv)])[table(conv[!is.na(conv)]) > 1] # ENTREZ IDs that were mapped to more than one ENSEMBL IDS
##
## 150763 283440 29994 4102 414245 51326 51463 5554 5664 643836 883
## 2 2 2 2 2 2 2 2 2 2 2
The names of the conv
object should be in the same order as in the column
Ensembl.Gene.ID
of the RNAsseq
table. We verify it and then assign the
conv
vector as a new ENTREZID
column in the RNAseq
dataset
Let’s take a closer look at the mappings.
Show code: Create an ENTREZID column in the RNASeq data frame
## [1] TRUE
We will now focus on the TSS of the genes and take a look at the binding of the factors around the TSS.
First, we extract a GRanges
object with all the TSS.
Show code: Create a GRanges object with the TSS for all genes
## 2162 genes were dropped because they have exons located on both strands
## of the same reference sequence or on more than one reference sequence,
## so cannot be represented by a single genomic range.
## Use 'single.strand.genes.only=FALSE' to get all the genes in a
## GRangesList object, or use suppressMessages() to suppress this message.
Then, we filter this tss
object to keep only the TSS for which we have gene
expression data.
Let’s sort the TSS by average gene expression in the cell lines (logCPM
column).
We use the match
function to find a match between the names of tss
and the
corresponding row in the RNAseq
dataset
## Add a metadata column with the logCPM from the RNAseq dataset
mcols(tss)$logCPM <- RNAseq$logCPM[match(names(tss), RNAseq$ENTREZID)]
## sort tss by decreasing logCPM
tss <- tss[order(tss$logCPM, decreasing = TRUE)]
Import the read coverage for the 3 factors at the TSS
Show code: Import the read coverage for SOX10, BRG1 and MITF at the TSS
cvg_tss <- list()
cvg_tss[["SOX10"]] <- readBigWig(file.path(params$workdir, "data", "ChIPseq", params$sox10.bw),
windows = tss + 4000)
cvg_tss[["BRG1"]] <- readBigWig(file.path(params$workdir, "data", "ChIPseq", params$brg1.bw),
windows = tss + 4000)
cvg_tss[["MITF"]] <- readBigWig(file.path(params$workdir, "data", "ChIPseq", params$mitf.bw),
windows = tss + 4000)
Now, we use genomation to produce the heatmaps of the different factors in regions covering +/-4kb around the TSS
Show code: Plot a heatmap of the signal for the 3 factors at the TSS
#Get the score matrices around the TSS
smr_at_TSS <- ScoreMatrixList(target = cvg_tss,
windows = tss + 4000,
bin.num = 400)
#plot the heatmap
multiHeatMatrix(smr_at_TSS,
xcoords = c(-4000, 4000),
winsorize = c(0, 99.95))
There doesn’t seem to be a very clear association between the binding of the
3 factors and the gene expression level as measured by logCPM
. Note that
logCPM
is an imperfect measure of the relative gene expression when comparing
different genes because it doesn’t take into account the length of the
transcripts.
Anyway, there also seems to be different patterns of BRG1 binding around the
TSS. So we are going to use kmeans to cluster the patterns of BRG1 signal
into 4 groups.
Show code: Partition de TSS into 4 groups based on BRG1 signal
Take a look at the partition that was made.
Show code: Plot a heatmap of the signal for the 3 factors for the partitioned TSS
#plot the heatmap
multiHeatMatrix(smr_at_TSS,
xcoords = c(-4000, 4000),
winsorize = c(0, 99.95),
group = factor(part_km4_tss))
The 4 groups roughly correspond to
We would like to know if there is some link between gene expression in WT
controls and shBRG1 cells and this partition of the BRG1 binding at the TSS.
To do this, we can add the WT.norm
and shBRG1
columns to the metadata of
the tss
object. We will also add the signif
columns that we will need
later.
## first get the match between the names of tss and the ENTREZIDs in the RNAseq table
mm <- match(names(tss), RNAseq$ENTREZID)
## then add the columns to the metadata of tss
mcols(tss)$WT.norm <- RNAseq$WT.norm[mm]
mcols(tss)$shBGR1.norm <- RNAseq$shBGR1.norm[mm]
mcols(tss)$signif <- RNAseq$signif[mm]
## also add the info about the clusters
mcols(tss)$clusters <- paste0("cluster", part_km4_tss)
Now we can extract a data frame from the tss
object and reformat it to make
a boxplot with ggplot2.
Show code: Reformat the mcols
of the tss object to prepare it for ggplot2
exprdf <- tidyr::pivot_longer(as.data.frame(mcols(tss))[, c("WT.norm", "shBGR1.norm", "clusters")],
cols = c("WT.norm", "shBGR1.norm"),
names_to = "condition",
values_to = "expression")
head(exprdf)
## # A tibble: 6 × 3
## clusters condition expression
## <chr> <chr> <dbl>
## 1 cluster2 WT.norm 491811.
## 2 cluster2 shBGR1.norm 61552.
## 3 cluster2 WT.norm 164202.
## 4 cluster2 shBGR1.norm 10699.
## 5 cluster2 WT.norm 64235.
## 6 cluster2 shBGR1.norm 31308.
And make the boxplot with ggplot2.
Show code: Make a boxplot with ggplot2
ggplot(exprdf, aes(x = clusters, y = log2(expression+1), color = condition)) +
geom_boxplot() +
scale_color_npg() +
theme_bw() +
theme(legend.position = "top")
We see that:
Let’s see if the distribution of “significantly” differentially expressed genes differs among clusters.
Show code: Make a barplot for the signif
variable for each cluster
ggplot(as.data.frame(mcols(tss)), aes(x = clusters, fill = signif)) +
geom_bar(position = "fill", color="black") +
scale_fill_manual(values=c(up="red",down="blue",stable="white")) +
theme_bw()
We confirm that cluster2 has more significantly upregulated genes than the other clusters.
We want to know if the genes in cluster 2, which are characterized by a diffuse binding pattern of BRG1 at their TSS and a tendency to be upregulated upon BRG1 knockdown, are enriched for specific Gene Ontology (GO) Biological Process annotations.
The clusterprofiler
package provides very useful functions to perform either classical enrichment
analysis of functional categories in gene lists or gene set enrichment analyses.
Here we will use the classical approach, based on the hypergeometric
distribution, called “enrichment analysis” or “GO over-representation analysis”
in the terminology of clusterprofiler documentation.
We can easily extract the IDs of the genes in cluster2 using for example:
names(tss)[tss$clusters == "cluster2"]
The main question that we need to answer is what is the gene universe? In general, you should think as the “gene universe” as the set of genes that could potentially have been selected in your gene list. Since we did different filters, e.g. by selecting genes present in the RNAseq dataset or that had a match between the ENSEMBLID and the ENTREZID, our gene universe does NOT encompass all the genes in the genome. In our case, a good definition of the gene universe would be all the genes in the tss object, since we performed kmeans on these genes to identify genes in cluster 2.
So now, let’s use this info to compute the GO over-representation statistics:
# load the library
library(clusterProfiler)
# Run the enrichment analysis
ego <- enrichGO(gene = names(tss)[tss$clusters == "cluster2"],
universe = names(tss),
OrgDb = org.Hs.eg.db,
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05,
readable = TRUE)
And take a look at the results:
## ID Description GeneRatio
## GO:0042438 GO:0042438 melanin biosynthetic process 10/1041
## GO:0046189 GO:0046189 phenol-containing compound biosynthetic process 12/1041
## GO:0006582 GO:0006582 melanin metabolic process 10/1041
## GO:0044550 GO:0044550 secondary metabolite biosynthetic process 10/1041
## GO:0002385 GO:0002385 mucosal immune response 8/1041
## GO:0032922 GO:0032922 circadian regulation of gene expression 17/1041
## BgRatio RichFactor FoldEnrichment zScore pvalue p.adjust
## GO:0042438 21/11891 0.4761905 5.439367 6.306766 3.642124e-06 0.005906749
## GO:0046189 31/11891 0.3870968 4.421679 5.908526 5.578739e-06 0.005906749
## GO:0006582 22/11891 0.4545455 5.192123 6.095926 6.154427e-06 0.005906749
## GO:0044550 22/11891 0.4545455 5.192123 6.095926 6.154427e-06 0.005906749
## GO:0002385 14/11891 0.5714286 6.527240 6.409445 6.247561e-06 0.005906749
## GO:0032922 59/11891 0.2881356 3.291278 5.464823 7.365024e-06 0.005906749
## qvalue
## GO:0042438 0.005694326
## GO:0046189 0.005694326
## GO:0006582 0.005694326
## GO:0044550 0.005694326
## GO:0002385 0.005694326
## GO:0032922 0.005694326
## geneID
## GO:0042438 PMEL/TYRP1/DCT/TYR/MFSD12/RAB38/ZEB2/GIPC1/SLC7A11/MC1R
## GO:0046189 PMEL/TYRP1/DCT/TYR/MFSD12/RAB38/ZEB2/SNCA/GIPC1/SLC7A11/NR4A2/MC1R
## GO:0006582 PMEL/TYRP1/DCT/TYR/MFSD12/RAB38/ZEB2/GIPC1/SLC7A11/MC1R
## GO:0044550 PMEL/TYRP1/DCT/TYR/MFSD12/RAB38/ZEB2/GIPC1/SLC7A11/MC1R
## GO:0002385 OTUD7B/RAB17/H2BC21/H2BC8/H2BC12/H2BC11/H2BC4/H2BC7
## GO:0032922 ATF4/PPP1CB/PPP1CC/HDAC2/BHLHE41/CSNK1E/PER1/TOP1/GFPT1/BHLHE40/RAI1/MYCBP2/NPAS2/NCOA2/ID2/NOCT/NR1D1
## Count
## GO:0042438 10
## GO:0046189 12
## GO:0006582 10
## GO:0044550 10
## GO:0002385 8
## GO:0032922 17
Now let’s perform the same type of analysis but focusing on the differentially
expressed (DE) genes present in each of the 4 clusters.
Here, we’re asking if DE genes that have different BRG1 binding patterns at
their TSS are enriched in specific KEGG pathways.
The clusterProfiler package provides the function compareCluster
that
allows to analyze and compare enrichment in different group of genes. This
function recognizes ENTREZ identifiers.
# Select genes whose gene expression is significantly changing
# between conditions
geneList <- tss[tss$signif != "stable",]
# The function split can split a vector to a list following categories
# in an other vector (take a look at the geneList object after running split).
geneList <- split(names(geneList), geneList$clusters)
# Now perform a KEGG pathway over-representation analysis
compKEGG <- compareCluster(geneCluster = geneList,
fun = "enrichKEGG",
pvalueCutoff = 0.05,
pAdjustMethod = "BH")
# Plot the results
dotplot(compKEGG,
showCategory = 15,
title = "KEGG Pathway Enrichment Analysis")
Let’s focus on cluster 4 and visualize which genes of the the KEGG
pathway “Apoptosis” are both differentially expressed and
bound by BRG1 and MITF.
We obtain the identifier of the pathway by looking at the compKEGG
object:
compKEGG@compareClusterResult
So the ID of the KEGG pathway “Apoptosis” is hsa04210.
We use the pathview
package which maps differential gene expression values
to KEGG pathway maps and create a PNG file in your current directory that you
get with getwd()
.
# load the library
library(pathview)
# Get the names of the genes in cluster 4
EntrezID.cl4 <- names(tss)[tss$clusters == "cluster4"]
# match these EntrezIDs to the RNAseq dataset
mmid <- match(EntrezID.cl4, RNAseq$ENTREZID)
# Get the log fold-change for these genes from the RNAseq table
logFC.cl4 <- RNAseq$logFC[mmid]
names(logFC.cl4) <- RNAseq$ENTREZID[mmid]
# Create an image of the pathway with colors for differential gene expression
pathview(logFC.cl4, pathway.id = "hsa04210", species = "hsa")
Often, one of the task associated with ChIP-seq experiments consists in
analyzing the sequences that underlie the peaks. The aim could be to identify
the binding motif of the factor that was immunoprecipitated, or to evaluate if
other factors could bind the same regions.
Here we will illustrate some basic tasks associated with these aims.
First, load the Biostrings library and the BSgenome library corresponding to the hg38 human genome.
## Load the Biostrings library and the BSgenome library corresponding to hg38
library(Biostrings)
library(BSgenome.Hsapiens.UCSC.hg38) ## A "Hsapiens" object allows to access the genome sequence
Here, we will search for motifs enriched under the SOX10 peaks. Since we do not start from a pre-defined list of motifs, e.g. available in a database like JASPAR or CIS-BP, this is called a de novo search. We start by sorting SOX10 peaks by qvalue and keep the sequences corresponding to the peak centers +/- 150bp:
sox10peaks_300bp <- resize(peakfilt[["SOX10"]][order(peakfilt[["SOX10"]]$qvalue,
decreasing = TRUE)],
width = 300,
fix = "center")
Next, we extract the underlying sequences and obtain a DNAStringSet
object:
Most of the time, you will want to export these sequences as a fasta file and run an external motif discovery program such as:
Note that the TFBStools
package has a runMEME
function that is a wrapper to MEME.
To export your the sequences you can use:
## Define the path to the fasta file:
myfastafile = file.path("data", "sox10peaks_300bp.fa")
## NOT RUN (already exported): export sequences in fasta format
writeXStringSet(sox10_seq, myfastafile)
Here, we will use the BCRank
package which can also perform a de novo search for motifs within R.
The method takes a list of DNA sequences sorted by some measure of the
probability that a relevant motif is present. Here, the qvalue plays the role
of this measure, which is why we sorted the peaks by qvalue.
BCrank
takes a long time to run, so we will not run the commands below.
##NOT RUN (too long...)
set.seed(123) #for reproducibility
BCRANKout <- bcrank(myfastafile,
restarts = 25,
silent = TRUE,
use.P1 = TRUE,
use.P2 = TRUE)
saveRDS(BCRANKout, file.path("data", "BCRANKout.rds"))
Instead, we ran the commands for you and saved the object as an rds file. So we can just import this file and take a look at it:
##
## An object of class "BCRANKresult"
##
## Top 25 DNA motifs predicted by BCRANK:
##
## Consensus Score
## 1 GCWTTGT 122.34709
## 2 VBRVCCTGCTTT 84.48699
## 3 GSRKCANTGTT 75.15736
## 4 VCTGTGAGB 71.66138
## 5 ATKGCACCWDD 69.77417
## 6 GVVTTGRGAB 69.57777
## 7 TYCAMAKCC 65.85154
## 8 TTSCCVAAG 65.51148
## 9 ASWTTTTGCHC 65.39492
## 10 SNCAASHCGG 64.43600
## 11 MCGGVYCAGC 64.12572
## 12 ACAAHKCAVGY 62.69532
## 13 TGGGGCSDDG 62.05447
## 14 TTTTGDAGSTCT 61.36914
## 15 TCGAKASC 59.74472
## 16 WTBCAATRATA 59.70302
## 17 CRKGTCDG 59.32284
## 18 AHAATDBGRCM 58.34722
## 19 SCYTCACBNGY 58.12427
## 20 CYAKTGNNVAC 56.36697
## 21 DTGTVHATTY 56.26696
## 22 HHTARTCTGTCY 55.40342
## 23 BCMVCGNGCA 55.34975
## 24 KCWSWTCCBTT 51.94605
## 25 CBCNVCGAGD 51.21762
##
##
## Use methods toptable(object) and fname(object) to access object slots.
We can extract the top motif found by BCRANK with:
and extract the Position Weight Matrix (PWM) for this motif and plot the sequence logo with:
pwmnorm_topmotif <- BCRANK::pwm(topMotif, normalize = TRUE) #with normalize=TRUE the columns will sum to 1 (PFM)
library(seqLogo)
seqLogo(reverseComplement(pwmnorm_topmotif))
You can go to the JASPAR website and search “SOX10”. You will see that the first motif found by BCRANK is very similar to the motifs that have been found for SOX10. We will actually retrieve and take a look at one of these motifs below.
Another approach consists in searching for known motifs in the ChIP-seq peaks. Since the SOX10 has been extensively studied, there are already SOX10 binding motifs available in the databases.
Search for SOX10 motifs in JASPAR databases
sox10_motifs = MotifDb::query(MotifDb, c("hsapiens", "jaspar", "sox10"))
## We keep the motif from JASPAR 2022
Sox10JASP22 <- sox10_motifs[["Hsapiens-jaspar2022-SOX10-MA0442.2"]]
seqLogo(Sox10JASP22)
MotifDb returns a position frequency matrix (PFM) with columns summing to 1. The number of sequences used to build the PFM are available in the elementMetadata or mcols of the object. We use this information to obtain a PWM:
nseq <- as.integer(mcols(sox10_motifs["Hsapiens-jaspar2022-SOX10-MA0442.2"])$sequenceCount)
sox10pfm <- apply(round(nseq * Sox10JASP22,0), 2, as.integer)
rownames(sox10pfm) <- rownames(Sox10JASP22)
Sox10JASP22_PWM <- PWM(sox10pfm)
And then use this PWM to screen the whole genome. Again, this takes some time so we saved the rds object for you
## NOT RUN (too long...)
SOX10_hg38 <- matchPWM(Sox10JASP22_PWM,
Hsapiens,
min.score="90%")
saveRDS(SOX10_hg38, file.path("data", "SOX10_hg38.rds"))
We obtain a GRanges with the location of all the motifs:
## GRanges object with 3127286 ranges and 2 metadata columns:
## seqnames ranges strand | score string
## <Rle> <IRanges> <Rle> | <numeric> <DNAStringSet>
## [1] chr1 11631-11641 + | 0.900030 TCCACAAAGTG
## [2] chr1 13572-13582 + | 0.911933 GCTACAAAGGT
## [3] chr1 16770-16780 + | 0.935700 AACACAAAGTG
## [4] chr1 19982-19992 + | 0.948358 GGGACAAAGGC
## [5] chr1 20010-20020 + | 0.930367 CCCACAAAGGA
## ... ... ... ... . ... ...
## [3127282] chrX_MU273397v1_alt 314005-314015 - | 0.929517 TGGACAAAGCT
## [3127283] chrX_MU273397v1_alt 319703-319713 - | 0.952440 AATACAAAGGG
## [3127284] chrX_MU273397v1_alt 319894-319904 - | 0.909040 AAAACAATGAA
## [3127285] chrX_MU273397v1_alt 324082-324092 - | 0.955583 CCAACAAAGGC
## [3127286] chrX_MU273397v1_alt 327059-327069 - | 0.960064 TACACAAAGAA
## -------
## seqinfo: 711 sequences (1 circular) from hg38 genome
We can now overlap these motifs with our ChIPseq peaks:
## We set a minimum overlap of 11bp (size of the motif) to count only
## the motifs that are fully within the peaks
count_sox10 <- countOverlaps(sox10peaks_300bp,
SOX10_hg38,
minoverlap = 11)
## Number of peaks that have 0, 1, 2, etc. SOX10 binding sites
table(count_sox10)
## count_sox10
## 0 1 2 3 4 5
## 1734 1724 611 123 20 2
## [1] 58.85145
We see that more than 58.9% of the peaks
contain at least one of these motifs.
Assessing whether our peaks are significantly enriched for SOX10 motifs would
probably require to extract random sequences that share common characteristics
with SOX10 ChIP-seq peaks in terms of size and genomic distribution and evaluate
their enrichment in SOX10 binding sites.
Here as a substitute, let’s evaluate the frequency of SOX10 peaks in a random
selection of promoters
set.seed(123) # for reproducibility
count_sox10_tss <- countOverlaps(promoters(genes(txdb), upstream=300, downstream=0)[sample.int(length(sox10peaks_300bp))],
SOX10_hg38,
minoverlap = 11)
## 2162 genes were dropped because they have exons located on both strands
## of the same reference sequence or on more than one reference sequence,
## so cannot be represented by a single genomic range.
## Use 'single.strand.genes.only=FALSE' to get all the genes in a
## GRangesList object, or use suppressMessages() to suppress this message.
## [1] 10.08543
This suggests that the enrichment of SOX10 binding sites in our ChIP-seq peaks is very significant.
There are multiple ways to access public / published datasets and multiple
sources that you can mine in order to compare your own data with what other
groups have observed.
The following databases are only some examples:
-The (mod)ENCODE project
- The UCSC
genome browser
- The GEO database and the SRA
- The ENA database (which also
replicates GEO) and the ArrayExpress database
- The TCGA
- The Cistrome data browser
- The ReMap database
- The ChIP atlas
- etc…
ChIPSeeker
There Gene Expression Omnibus (GEO) contains a vast number of ChIP-seq datasets.
We can compare our own dataset to those deposited in
GEO to search for significant overlap data. Significant overlap of ChIP
seq data by different binding proteins may be used to infer cooperative
regulation and thus can be used to generate hypotheses.
The ChIPseeker package (Wang et al. 2022) package provides useful functions to download and use GEO datasets in R / Bioconductor.
We can collect >100,000 bed files deposited in GEO. You can use
getGEOspecies
to get the number of files available for different species.
## species Freq
## 1 Actinobacillus pleuropneumoniae 1
## 2 Actinobacillus pleuropneumoniae serovar 3 str. JL03 1
## 3 Aedes aegypti 11
## 4 Anabaena 6
## 5 Anolis carolinensis 7
## 6 Anopheles gambiae 2
## 7 Apis mellifera 17
## 8 Apis mellifera scutellata 1
## 9 Arabidopsis 17
## 10 Arabidopsis lyrata 4
## 11 Arabidopsis thaliana 2473
## 12 Atelerix albiventris 2
## 13 Bombus terrestris 8
## 14 Bos taurus 257
## 15 Brachypodium distachyon 10
## 16 Branchiostoma lanceolatum 93
## 17 Brassica rapa 12
## 18 Caenorhabditis elegans 532
## 19 Callithrix jacchus 48
## 20 Candida albicans 34
## 21 Candida dubliniensis 20
## 22 Canis lupus 1
## 23 Canis lupus familiaris 35
## 24 Capsaspora owczarzaki 21
## 25 Carica papaya 1
You can also use the getGEOgenomeVersion
to get the same information but
based on genome version used to analyze the data.
You can access the detailed information using the getGEOInfo
function,
for each genome version.
## series_id gsm organism title
## 16488 GSE58207 GSM1403308 Homo sapiens Lactimidomycin treated HCT116
## 16489 GSE58207 GSM1403308 Homo sapiens Lactimidomycin treated HCT116
## 16490 GSE58207 GSM1403307 Homo sapiens Cycloheximide treated HCT116
## 16491 GSE58207 GSM1403307 Homo sapiens Cycloheximide treated HCT116
## 20345 GSE67978 GSM1660032 Homo sapiens H3K27ac_Human_Brain_WhiteMatter_HS2
## 20351 GSE67978 GSM1660029 Homo sapiens H3K27ac_Human_Brain_OccipitalPole_HS2
## supplementary_file
## 16488 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1403nnn/GSM1403308/suppl/GSM1403308_HCT116_LTM_sense.bedGraph.gz
## 16489 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1403nnn/GSM1403308/suppl/GSM1403308_HCT116_LTM_anti-sense.bedGraph.gz
## 16490 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1403nnn/GSM1403307/suppl/GSM1403307_HCT116_CHX_anti-sense.bedGraph.gz
## 16491 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1403nnn/GSM1403307/suppl/GSM1403307_HCT116_CHX_sense.bedGraph.gz
## 20345 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1660nnn/GSM1660032/suppl/GSM1660032_H3K27ac_Human_Brain_WhiteMatter_HS2_hg38_peaks.narrowPeak.gz
## 20351 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1660nnn/GSM1660029/suppl/GSM1660029_H3K27ac_Human_Brain_OccipitalPole_HS2_hg38_peaks.narrowPeak.gz
## genomeVersion pubmed_id
## 16488 hg38 <NA>
## 16489 hg38 <NA>
## 16490 hg38 <NA>
## 16491 hg38 <NA>
## 20345 hg38 <NA>
## 20351 hg38 <NA>
ChIPseeker provides the function downloadGEObedFiles
to download all
the bed files of a particular genome.
Or a vector of GSM accession number by downloadGSMbedFiles
.
After the download of bed files from GEO, we can pass them to
enrichPeakOverlap
for testing the significance of the overlaps. Parameter
targetPeak can be the folder, e.g. hg38, that containing bed files.
enrichPeakOverlap
will parse the folder and compare all the bed files.
It is possible to test the overlap with bed files that are mapped to
different genome or different genome versions: enrichPeakOverlap
provides a parameter chainFile
that can pass a chain file and liftOver
the targetPeak
to the genome version consistent with queryPeak.
Significant overlaps can be used to generate hypotheses on cooperative
regulation. By mining the data deposited in GEO, we can identify putative
complexes or interacting regulators of gene expression or chromatin remodeling
for further validation.
AnnotationHub
The AnnotationHub is a web resource that provides a central location where
genomic files in different formats (e.g. vcf, bed, wig, etc.) and other resources
coming from different locations (e.g. UCSC, ENSEMBL, ENCODE, etc.) can be mined
and downloaded to enrich your analyses. Importantly, the resource that are
available include metadata with e.g. description of the experiments, tags
and date of modification which are important to figure out where exactly the
data comes from and if it fits with the aim of your analyses.
The AnnotationHub
package is a client for R/Bioconductor to search these resources and download
the data in appropriate formats for their analysis in the R/Bioconductor
environment.
The client creates and manages a local cache of files that are retrieved by the
user, which allows quick and reproducible access to the data.
First we need to create an AnnotationHub
instance. This will create a local
cache on your machine so that repeat queries for the same information (even in
different R sessions) will be very fast. By default, the cache will be created
in your home folder, which may not be appropriate in a shared server if you have
a limited quota of space in your home directory. You can change the location of
the cache with:
## NOT RUN: example of how to set the path to the cache
setAnnotationHubOption("CACHE") <- "/shared/projects/projectname/AnnotationHub"
## to check that the path was set correctly:
getAnnotationHubOption("CACHE")
Then we create the AnnotationHub
instance with:
##NOT RUN (too long...)
ah <- AnnotationHub()
# |=============================================================================| 100%
#
# snapshotDate(): 2024-10-28
This will take some time the first time you run it since there are a lot of
resources available. The ah
object is quite big but it only contains pointers
to online data, not the data itself which would be way too big to download all
at once. The information content is changing frequently, which is why there is
a snapshotDate
.
Let’s take a look at the ah
object:
##NOT RUN
ah
# AnnotationHub with 72100 records
# # snapshotDate(): 2024-10-28
# # $dataprovider: Ensembl, BroadInstitute, UCSC, ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/, FANTOM5,DLRP,IUPHAR,HPRD,STRING,SWISSPROT,TREMBL,ENSEMBL,CELLPHONEDB,...
# # $species: Homo sapiens, Mus musculus, Drosophila melanogaster, Rattus norvegicus, Bos taurus, Pan troglodytes, Danio rerio, Gallus gallus, Sus scrofa, Mon...
# # $rdataclass: GRanges, TwoBitFile, BigWigFile, EnsDb, Rle, OrgDb, SQLiteFile, ChainFile, TxDb, Inparanoid8Db
# # additional mcols(): taxonomyid, genome, description, coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags, rdatapath,
# # sourceurl, sourcetype
# # retrieve records with, e.g., 'object[["AH5012"]]'
#
# title
# AH5012 | Chromosome Band
# AH5013 | STS Markers
# AH5014 | FISH Clones
# AH5015 | Recomb Rate
# AH5016 | ENCODE Pilot
# ... ...
# AH119506 | Ensembl 113 EnsDb for Zonotrichia albicollis
# AH119507 | Ensembl 113 EnsDb for Zalophus californianus
# AH119508 | Ensembl 113 EnsDb for Zosterops lateralis melanops
# AH119518 | MassBank CompDb for release 2024.06
# AH119519 | MassBank CompDb for release 2024.11
As of 2024-10-28
, the AnnotationHub contains 72,100 records.
The ah
object is organized as a vector (so its length
is 72100 in my case).
You can get information about a single resource by indexing with a single [
;
using two brackets ([[
) downloads the object:
##NOT RUN
ah[1]
# AnnotationHub with 1 record
# # snapshotDate(): 2024-10-28
# # names(): AH5012
# # $dataprovider: UCSC
# # $species: Homo sapiens
# # $rdataclass: GRanges
# # $rdatadateadded: 2013-03-26
# # $title: Chromosome Band
# # $description: GRanges object from UCSC track 'Chromosome Band'
# # $taxonomyid: 9606
# # $genome: hg19
# # $sourcetype: UCSC track
# # $sourceurl: rtracklayer://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/cytoBand
# # $sourcesize: NA
# # $tags: c("cytoBand", "UCSC", "track", "Gene", "Transcript", "Annotation")
# # retrieve record with 'object[["AH5012"]]'
Now, we can use various tools to narrow down the dataset or datasets that are
actually of interest for our analyses.
Let’s say that we are looking for a dataset describing regions in the human
genome that are potential distal enhancers. The ENCODE project has typically
defined such regions using the data they have produced on e.g. histone marks
H3K27ac, H3K4me3 or DNAseI accessibility.
Let’s take a look at the data providers:
##NOT RUN
unique(ah$dataprovider)
# [1] "UCSC"
# [2] "Ensembl"
# [3] "RefNet"
# [4] "Inparanoid8"
# ...
# [72] "ENCODE"
# ...
# [83] "MGI"
# [84] "ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/"
Let’s take a look at the available data from encode:
##NOT RUN
## Number of datasets available from ENCODE (note that there are more since in
## "dataprovider" there is also e.g. "ENCODE Project" or "ENCODE SCREEN v3"
sum(ah$dataprovider == "ENCODE")
# [1] 20
# So there are 20 datasets from the dataprovider "ENCODE"
## Take a look at the titles of these datasets
ah[ah$dataprovider == "ENCODE"]$title
# [1] "hg38.Kundaje.GRCh38_unified_Excludable" "hg38.Bernstein.Mint_Excludable_GRCh38"
# [3] "hg38.Kundaje.GRCh38.Excludable" "hg38.Reddy.wgEncodeDacMapabilityConsensusExcludable.hg38"
# [5] "hg38.Wold.hg38mitoExcludable" "hg38.Yeo.eCLIP_Excludableregions.hg38liftover.bed.fixed"
# [7] "hg19.Bernstein.Mint_Excludable_hg19" "hg19.Birney.wgEncodeDacMapabilityConsensusExcludable"
# [9] "hg19.Crawford.wgEncodeDukeMapabilityRegionsExcludable" "hg19.Wold.hg19mitoExcludable"
# [11] "hg19.Yeo.eCLIP_Excludableregions.hg19" "mm10.Hardison.Excludable.full"
# [13] "mm10.Hardison.psuExcludable.mm10" "mm10.Kundaje.anshul.Excludable.mm10"
# [15] "mm10.Kundaje.mm10.Excludable" "mm10.Wold.mm10mitoExcludable"
# [17] "mm9.Wold.mm9mitoExcludable" "ENCODE_dELS_regions"
# [19] "ENCODE_pELS_regions" "ENCODE_PLS_regions"
I wonder what “PLS_regions” are 🤔 and I want to know more. So let’s take a look at the object description:
##NOT RUN
ah[ah$dataprovider == "ENCODE" & ah$title == "ENCODE_PLS_regions"]$description
# [1] "A GRanges object containing regions of candidate cis-regulatory elements with promoter-like signatures
# as identified by the ENCODE SCREEN project. These consist of regions with high H3K4me3 and DNase signal, and
# located within 200 bp of a GENCODEtranscription start site."
OK, this looks very close to what we are looking for and I suppose that pELS and dELS refer to proximal and distal enhancer-like signatures. Let’s verify:
##NOT RUN
ah[ah$dataprovider == "ENCODE" & ah$title == "ENCODE_dELS_regions"]$description
# [1] "A GRanges object containing regions of candidate cis-regulatory elements with distal enhancer-like
# signatures as identified by the ENCODE SCREEN project. These consist of regions with high H3K27ac and
# DNase signal, but low H3K4me3 signal, and located more than 2kb from GENCODE transcription start sites."
So now we can get the names of the object, download it and use it in R (it’s a GRanges object):
##NOT RUN
## Get the name / ID of the dataset
names(ah[ah$dataprovider == "ENCODE" & ah$title == "ENCODE_dELS_regions"])
# [1] "AH116727"
## Download the dataset (use double brackets `[[`)
dELS <- ah[["AH116727"]]
# downloading 1 resources
# retrieving 1 resource
# |===========================================================================| 100%
#
# loading from cache
# require(“GenomicRanges”)
## Take a look at the object:
dELS
# GRanges object with 786756 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr1 271227-271468 *
# [2] chr1 274330-274481 *
# [3] chr1 605331-605668 *
# [4] chr1 727122-727350 *
# [5] chr1 807737-807916 *
# ... ... ... ...
# [786752] chrY 26319537-26319816 *
# [786753] chrY 26319848-26320155 *
# [786754] chrY 26363541-26363721 *
# [786755] chrY 26670949-26671287 *
# [786756] chrY 26671293-26671478 *
# -------
# seqinfo: 24 sequences from an unspecified genome; no seqlengths
## Since seqinfo has not been set, you may wonder which genome version was used.
## You can get this info from the metadata of the ah object:
ah['AH116727']$genome
# [1] "hg38"
In another example, we will use some useful function to search the AnnotationHub:
subset
(or [
) to do a specific subseting operation.query
to do a command-line search over the metadata of the hub.display
to get a Shiny interface in a browser, so you can browse the object.We would like to know if there are any other dataset available for MITF in human.
We first filter the hub to keep only the datasets obtained in human, or the
datasets where the species is unspecified (NA
), in case some metadata is
missing.
Then we query the metadata for MITF:
##NOT RUN
query(ah, "MITF")
# AnnotationHub with 1 record
# # snapshotDate(): 2024-10-28
# # names(): AH22280
# # $dataprovider: Pazar
# # $species: NA
# # $rdataclass: GRanges
# # $rdatadateadded: 2015-03-09
# # $title: pazar_MITF_Strub_20120522.csv
# # $description: TF - Target Gene file from pazar_MITF_Strub_20120522
# # $taxonomyid: NA
# # $genome: NA
# # $sourcetype: CSV
# # $sourceurl: http://www.pazar.info/tftargets/pazar_MITF_Strub_20120522.csv
# # $sourcesize: 34964
# # $tags: c("Pazar", "Transcription Factors")
# # retrieve record with 'object[["AH22280"]]'
Indeed this dataset does not contain all the metadata but there is a link to the original source file and searching the web for “MITF, Strub” quickly gets you to this publication where all the info is available to evaluate if this dataset is relevant for us.
Finally, another way to mine the AnnotationHub is to use a browser with:
Note how we assign the call to the display
function to an R object (oudata
)
so that we can actually capture the selection of datasets that we will make from
the browser into our R session.
In previous annotation steps, we have annotated peaks using annotation from UCSC’s knownGene. The database chosen for annotation can have an impact on subsequent results including data integration such as comparison between genes associated to peaks and gene expression using RNA-seq data for instance.
Let’s look at the overlap between genes expressed in RNA-seq data with genes associated to peaks (let’s remind that peaks are associated to closest genes if no other evidences are used).
## Download GTF files
library(GenomicFeatures)
## Import BRG1 peaks annotated with Ensembl and Refseq annotation
annot <- readRDS(file.path(params$workdir, "data", params$annot))
## Here is how annotation data were downloaded
# txdb.ensembl = makeTxDbFromGFF("data/ChIPseq/Homo_sapiens.GRCh38.103_UCSC_chr.gtf.gz")
# txdb.refseq = makeTxDbFromUCSC(genome="hg38", tablename="refGene")
## Here is how annotation was performed
# annot <- list()
# annot[["ensembl"]] <- annotatePeak(peaks$BRG1, tssRegion=c(-1000, 100), TxDb=txdb.ensembl, annoDb="org.Hs.eg.db")
# annot[["refseq"]] <- annotatePeak(peaks$BRG1, tssRegion=c(-1000, 100), TxDb=txdb.refseq, annoDb="org.Hs.eg.db")
## Add Gene symbols to RNAseq data
conv.symbol <- mapIds(x = org.Hs.eg.db,
keys = RNAseq$Ensembl.Gene.ID, # what do we want to be mapped
column = c("SYMBOL"), # which type of ID we want
keytype = "ENSEMBL") # what type of ID we give
mEnsRS <- match(RNAseq$Ensembl.Gene.ID,
names(conv.symbol))
RNAseq$SYMBOL <- conv.symbol[mEnsRS]
## Now let's create a venn diagram that compares peak annotation with
## refseq or Ensembl
library(ggVennDiagram)
x <- list(RNAseq = RNAseq[RNAseq$signif=="up",]$SYMBOL,
Ensembl = as.data.frame(annot$ensembl)$SYMBOL,
Refseq = as.data.frame(annot$refseq)$SYMBOL
)
ggVennDiagram(x) + scale_fill_gradient(low="white",high = "blue")
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 18.04.6 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Europe/Paris
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 grid stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] knitr_1.49
## [2] MotifDb_1.48.0
## [3] seqLogo_1.72.0
## [4] BCRANK_1.68.0
## [5] BSgenome.Hsapiens.UCSC.hg38_1.4.5
## [6] BSgenome_1.74.0
## [7] BiocIO_1.16.0
## [8] Biostrings_2.74.0
## [9] XVector_0.46.0
## [10] pathview_1.46.0
## [11] clusterProfiler_4.14.3
## [12] org.Hs.eg.db_3.20.0
## [13] TxDb.Hsapiens.UCSC.hg38.knownGene_3.20.0
## [14] GenomicFeatures_1.58.0
## [15] AnnotationDbi_1.68.0
## [16] Biobase_2.66.0
## [17] circlize_0.4.16
## [18] EnrichedHeatmap_1.36.0
## [19] ComplexHeatmap_2.22.0
## [20] rtracklayer_1.66.0
## [21] UpSetR_1.4.0
## [22] reshape2_1.4.4
## [23] skimr_2.1.5
## [24] ggsci_3.2.0
## [25] RColorBrewer_1.1-3
## [26] ggplot2_3.5.2
## [27] ChIPpeakAnno_3.40.0
## [28] GenomicRanges_1.58.0
## [29] GenomeInfoDb_1.42.0
## [30] IRanges_2.40.0
## [31] S4Vectors_0.44.0
## [32] BiocGenerics_0.52.0
## [33] genomation_1.38.0
## [34] ChIPseeker_1.42.1
##
## loaded via a namespace (and not attached):
## [1] fs_1.6.6
## [2] ProtGenerics_1.38.0
## [3] matrixStats_1.5.0
## [4] bitops_1.0-9
## [5] enrichplot_1.29.0
## [6] lubridate_1.9.3
## [7] httr_1.4.7
## [8] doParallel_1.0.17
## [9] Rgraphviz_2.50.0
## [10] repr_1.1.7
## [11] InteractionSet_1.34.0
## [12] tools_4.4.2
## [13] R6_2.6.1
## [14] lazyeval_0.2.2
## [15] GetoptLong_1.0.5
## [16] withr_3.0.2
## [17] prettyunits_1.2.0
## [18] gridExtra_2.3
## [19] VennDiagram_1.7.3
## [20] cli_3.6.5
## [21] formatR_1.14
## [22] labeling_0.4.3
## [23] sass_0.4.9
## [24] KEGGgraph_1.66.0
## [25] readr_2.1.5
## [26] Rsamtools_2.22.0
## [27] yulab.utils_0.2.0
## [28] gson_0.1.0
## [29] DOSE_4.0.0
## [30] R.utils_2.12.3
## [31] plotrix_3.8-4
## [32] rstudioapi_0.17.1
## [33] impute_1.80.0
## [34] RSQLite_2.3.9
## [35] generics_0.1.3
## [36] gridGraphics_0.5-1
## [37] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [38] shape_1.4.6.1
## [39] gtools_3.9.5
## [40] dplyr_1.1.4
## [41] GO.db_3.20.0
## [42] Matrix_1.7-1
## [43] futile.logger_1.4.3
## [44] abind_1.4-8
## [45] R.methodsS3_1.8.2
## [46] lifecycle_1.0.4
## [47] yaml_2.3.10
## [48] SummarizedExperiment_1.36.0
## [49] gplots_3.2.0
## [50] qvalue_2.38.0
## [51] SparseArray_1.6.0
## [52] BiocFileCache_2.14.0
## [53] blob_1.2.4
## [54] crayon_1.5.3
## [55] pwalign_1.2.0
## [56] ggtangle_0.0.6
## [57] lattice_0.22-6
## [58] cowplot_1.1.3
## [59] KEGGREST_1.46.0
## [60] pillar_1.10.2
## [61] fgsea_1.32.0
## [62] rjson_0.2.23
## [63] boot_1.3-31
## [64] codetools_0.2-20
## [65] fastmatch_1.1-6
## [66] glue_1.8.0
## [67] ggfun_0.1.8
## [68] data.table_1.17.0
## [69] vctrs_0.6.5
## [70] png_0.1-8
## [71] treeio_1.30.0
## [72] rmdformats_1.0.4
## [73] gtable_0.3.6
## [74] assertthat_0.2.1
## [75] cachem_1.1.0
## [76] xfun_0.49
## [77] S4Arrays_1.6.0
## [78] survival_3.7-0
## [79] iterators_1.0.14
## [80] nlme_3.1-166
## [81] ggtree_3.14.0
## [82] bit64_4.6.0-1
## [83] progress_1.2.3
## [84] filelock_1.0.3
## [85] bslib_0.8.0
## [86] KernSmooth_2.23-24
## [87] splitstackshape_1.4.8
## [88] colorspace_2.1-1
## [89] DBI_1.2.3
## [90] seqPattern_1.38.0
## [91] tidyselect_1.2.1
## [92] bit_4.6.0
## [93] compiler_4.4.2
## [94] curl_6.2.2
## [95] httr2_1.0.6
## [96] graph_1.84.0
## [97] xml2_1.3.8
## [98] DelayedArray_0.32.0
## [99] bookdown_0.41
## [100] scales_1.3.0
## [101] caTools_1.18.3
## [102] RBGL_1.82.0
## [103] rappdirs_0.3.3
## [104] stringr_1.5.1
## [105] digest_0.6.37
## [106] rmarkdown_2.29
## [107] htmltools_0.5.8.1
## [108] pkgconfig_2.0.3
## [109] base64enc_0.1-3
## [110] MatrixGenerics_1.18.0
## [111] dbplyr_2.5.0
## [112] regioneR_1.38.0
## [113] fastmap_1.2.0
## [114] ensembldb_2.30.0
## [115] rlang_1.1.6
## [116] GlobalOptions_0.1.2
## [117] UCSC.utils_1.2.0
## [118] farver_2.1.2
## [119] jquerylib_0.1.4
## [120] jsonlite_2.0.0
## [121] BiocParallel_1.40.0
## [122] GOSemSim_2.32.0
## [123] R.oo_1.27.0
## [124] RCurl_1.98-1.17
## [125] magrittr_2.0.3
## [126] GenomeInfoDbData_1.2.13
## [127] ggplotify_0.1.2
## [128] patchwork_1.3.0
## [129] munsell_0.5.1
## [130] Rcpp_1.0.14
## [131] ape_5.8
## [132] stringi_1.8.7
## [133] zlibbioc_1.52.0
## [134] MASS_7.3-61
## [135] plyr_1.8.9
## [136] parallel_4.4.2
## [137] ggrepel_0.9.6
## [138] splines_4.4.2
## [139] multtest_2.62.0
## [140] hms_1.1.3
## [141] locfit_1.5-9.10
## [142] igraph_2.1.4
## [143] emo_0.0.0.9000
## [144] biomaRt_2.62.0
## [145] futile.options_1.0.1
## [146] XML_3.99-0.18
## [147] evaluate_1.0.1
## [148] universalmotif_1.24.2
## [149] lambda.r_1.2.4
## [150] tzdb_0.4.0
## [151] foreach_1.5.2
## [152] tidyr_1.3.1
## [153] purrr_1.0.4
## [154] clue_0.3-66
## [155] gridBase_0.4-7
## [156] restfulr_0.0.15
## [157] AnnotationFilter_1.30.0
## [158] tidytree_0.4.6
## [159] tibble_3.2.1
## [160] aplot_0.2.5
## [161] memoise_2.0.1
## [162] GenomicAlignments_1.42.0
## [163] cluster_2.1.6
## [164] timechange_0.3.0