--- title: "EBAIIn2 - ChIP-seq Workshop" author: "**Authors**: Pascal Martin, Elodie Darbo, Stephanie Le Gras **Contributors**: Rachel Legendre & Lucie Khamvongsa-Charbonnier" date: '`r format(Sys.Date(), "%B %d, %Y")`' output: rmdformats::material: use_bookdown: true thumbnails: true cards: false params: ensembl: 103 assembly: hg38 organism: Homo sapiens workdir: /shared/projects/2514_ebiii_n2/EBIII_N2/chip-seq # workdir: /DATA/WORK/Enseignement/2025_EBAII_ChIPseq/EBIII_N2/chip-seq mapping.chipseq: stats_mappingChIPseq.tsv peakcalling.chipseq: stats_peakCalling.tsv brg1: BRG1siCTRL_CHIP-seq_peaks.narrowPeak mitf: MITF_CHIP-seq_peaks.narrowPeak sox10: SOX10_CHIP-seq_peaks.narrowPeak brg1.bw: BRG1siCTRL_CHIP-seq.bw mitf.bw: MITF_CHIP-seq.bw sox10.bw: SOX10_CHIP-seq.bw rnaseq: RNAseq_diff_norm.rds annot: annot.rds bibliography: references.bib editor_options: markdown: wrap: 72 --- ```{r setup, include=FALSE} #knitr::opts_knit$set(echo = TRUE, root.dir="~/Documents/EBIII_N2/chip-seq/", fig.path = "images/") knitr::opts_knit$set(echo = TRUE, root.dir="/DATA/WORK/Enseignement/2025_EBAII_ChIPseq/EBIII_N2/chip-seq/", fig.path = "images/") ``` ```{r libLoad, echo=FALSE, warning=FALSE, message=FALSE} library(ChIPseeker) library(genomation) library(ChIPpeakAnno) library(ggplot2) library(RColorBrewer) library(ggsci) #https://cran.r-project.org/web/packages/ggsci/vignettes/ggsci.html library(GenomicRanges) library(skimr) library(reshape2) library(UpSetR) library(rtracklayer) library(EnrichedHeatmap) library(circlize) library(TxDb.Hsapiens.UCSC.hg38.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene library(org.Hs.eg.db) library(clusterProfiler) library(pathview) library(Biostrings) library(BSgenome.Hsapiens.UCSC.hg38) library(BCRANK) library(seqLogo) library(MotifDb) ``` ```{r SetParamsHidden, eval=FALSE, echo=FALSE} params <- list(workdir = "/DATA/WORK/Enseignement/2025_EBAII_ChIPseq/EBIII_N2/chip-seq", mapping.chipseq = "stats_mappingChIPseq.tsv", peakcalling.chipseq = "stats_peakCalling.tsv", brg1 = "BRG1siCTRL_CHIP-seq_peaks.narrowPeak", mitf = "MITF_CHIP-seq_peaks.narrowPeak", sox10 = "SOX10_CHIP-seq_peaks.narrowPeak", brg1.bw = "BRG1siCTRL_CHIP-seq.bw", mitf.bw = "MITF_CHIP-seq.bw", sox10.bw = "SOX10_CHIP-seq.bw", rnaseq = "RNAseq_diff_norm.rds", annot = "annot.rds" ) ``` # Introduction During this training session, we are going to use ChIP-seq and RNA-seq data from Laurette et al. [@laurette_transcription_2015] which are available in GEO as [GSE61967](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE61967). Data have already been processed. ## Processing of RNA-seq data ### Preprocessing Reads were preprocessed in order to remove adapter, polyA and low-quality sequences (Phred quality score below 20). After this preprocessing, reads shorter than 40 bases were discarded for further analysis. These preprocessing steps were performed using cutadapt [@martin_cutadapt_2011] version 1.10. ### Mapping Reads were mapped onto the hg38 assembly of Homo sapiens genome using STAR [@dobin_star:_2013] version 2.5.3a. ### Quantification Gene expression quantification was performed from uniquely aligned reads using htseq-count version 0.6.1.p1 [@anders_htseqpython_2015], with annotations from Ensembl version 103 and "union" mode. Only non-ambiguously assigned reads have been retained for further analyses. ### Normalization Read counts have been normalized across samples with the median-of-ratios method proposed by Anders and Huber [@AND2010], to make these counts comparable between samples. ### Differential expression analysis Differential expressed genes between wild type ans shBRG1 cells was performed using the `edgeR` R package [@Robinson2010]. We called significant changes when FDR \< 0.01, absolute log fold change over 1 and minimum average log normalized count over 5. ## Processing of ChIP-seq data ### Mapping Reads were mapped to `r params$organism` genome (assembly `r params$assembly`) using Bowtie [@langmead_ultrafast_2009] v1.0.0 with default parameters except for "-p 3 -m 1 --strata --best --chunkmbs 128". The following table shows the number of reads aligned to the `r params$organism` genome. ```{r stats.chipseq, echo=FALSE} align.stat <- read.table(file.path(params$workdir, "data", "ChIPseq", params$mapping.chipseq), sep="\t", quote="", header=T, check.names = F) library(knitr) align.stat = align.stat[order(align.stat$"Sample ID"),] kable(align.stat, caption = "Mapping statistics of ChIP-seq data analyzed during this training session. Column \"Raw reads\" corresponds to the number of input reads. Column \"Aligned reads\" corresponds to the number of reads aligned exactly 1 time. Column \"Multimapped\" corresponds to the number of reads aligned > 1 times. Column \"Unmapped\" corresponds to the number of reads aligned 0 time.", row.names = FALSE) ``` ### Peak Calling Prior to peak calling, reads falling into Encode blacklisted regions [@anshulkundaje_2014] were removed using bedtools intersect v2.26.0 [@quinlan_bedtools:_2010]. Then peak calling was done with Macs2 v2.1.1 with default parameters. ```{r peakcalling, echo=FALSE} peak.stat <- read.table(file.path(params$workdir, "data", "ChIPseq", params$peakcalling.chipseq), sep="\t", quote="", header=T, check.names = F) peak.stat = peak.stat[order(peak.stat$"IP sample"),] kable(peak.stat, caption = "Number of peaks detected", row.names = FALSE) ``` ### Generation of BigWig files Normalized BigWig files were generated using Homer [@heinz_simple_2010] makeUCSCfile v4.11.0 with the following parameter '-norm 1e7' meaning that data were normalized to 10M reads. ## Start a project and define the paths to the data Before starting any project, it is a good idea to setup the path to the project as a variable. For example, we could use: ```{r set_projPath, eval=FALSE, echo=TRUE} projPath <- "/shared/projects/2514_ebiii_n2/chipseq" ## If you are not using RStudio, you can change the working directory with: setwd(projPath) ##NOT RUN ``` If you use an Rmd file and RStudio, when you open the file, your working directory should already be set as the folder where this Rmd file is located. You can check it with: ```{r getworkdir, eval=FALSE, echo=TRUE} getwd() ``` Here, we will use different files for our analyses. We can create a list containing the absolute paths or names to the different files and folders that we will use during the practice session. ```{r setParams, eval=FALSE, echo=TRUE} params <- list( workdir = "/shared/projects/2514_ebiii_n2/chipseq", mapping.chipseq = "stats_mappingChIPseq.tsv", peakcalling.chipseq = "stats_peakCalling.tsv", brg1 = "BRG1siCTRL_CHIP-seq_peaks.narrowPeak", mitf = "MITF_CHIP-seq_peaks.narrowPeak", sox10 = "SOX10_CHIP-seq_peaks.narrowPeak", brg1.bw = "BRG1siCTRL_CHIP-seq.bw", mitf.bw = "MITF_CHIP-seq.bw", sox10.bw = "SOX10_CHIP-seq.bw", rnaseq = "RNAseq_diff_norm.rds", annot = "annot.rds" ) ``` # Importing the peaks and obtaining some basic statistics ## Importing peak data Peak files are in [narrowPeak format](https://genome.ucsc.edu/FAQ/FAQformat.html#format12) which is of the form: 1. chrom - Name of the chromosome (or contig, scaffold, etc.). 2. chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0 (i.e. 0-based). 3. chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99. 4. name - Name given to a region (preferably unique). Use "." if noname is assigned. 5. score - Indicates how dark the peak will be displayed in the browser (0-1000). If all scores were "'0"' when the data were submitted to the DCC, the DCC assigned scores 1-1000 based on signal value. Ideally the average signalValue per base spread is between 100-1000. 6. strand - +/- to denote strand or orientation (whenever applicable). Use "." if no orientation is assigned. 7. signalValue - Measurement of overall (usually, average) enrichment for the region. 8. pValue - Measurement of statistical significance (-log10). Use -1 if no pValue is assigned. 9. qValue - Measurement of statistical significance using false discovery rate (-log10). Use -1 if no qValue is assigned. 10. peak - Point-source called for this peak; 0-based offset from chromStart. Use -1 if no point-source called. There are different ways to import such data. We could use a simple `read.table` function but we know that it would not convert the 0-based coordinates to the 1-based coordinates used in R/Bioconductor. The [ChIPseeker](https://www.bioconductor.org/packages/release/bioc/html/ChIPseeker.html) package provides a `readPeakFile` function that we could use like this: ```{r loadData_ChIPseeker, eval=FALSE, echo=TRUE} ##NOT RUN library(ChIPseeker) peaks <- list() peaks[["BRG1"]] <- readPeakFile(file.path(params$workdir,"data", "ChIPseq", params$brg1), as="GRanges") ``` However, the function is essentially a wrapper to the `read.delim` function and does not assign useful column names to the imported data. So instead we will use the `readNarrowPeak` function from the [genomation](https://www.bioconductor.org/packages/release/bioc/html/genomation.html) package to import the *narrowPeak* files obtained for the 3 factors (BRG1, MITF and SOX10): ```{r loadData_genomation, eval=TRUE, cache=TRUE} library(genomation) peaks <- list() peaks[["BRG1"]] <- genomation::readNarrowPeak(file.path(params$workdir, "data", "ChIPseq", params$brg1)) peaks[["MITF"]] <- readNarrowPeak(file.path(params$workdir, "data", "ChIPseq", params$mitf)) peaks[["SOX10"]] <- readNarrowPeak(file.path(params$workdir, "data", "ChIPseq", params$sox10)) ``` The package also provides a `readGeneric` function that can adapt to different peak file formats. For example, the following command is equivalent to what we did above for BRG1: ```{r loadData_readGeneric, eval=FALSE} ## NOT RUN peaks[["BRG1"]] <- readGeneric(file.path(params$workdir,"data", "ChIPseq", params$mitf), meta.cols = list(name = 4, score = 5, signalValue = 7, pValue = 8, qValue = 9, peak = 10), zero.based = TRUE) ``` Let's take a look at how the data was imported: ```{r lookAtImportedPeaks} peaks[["BRG1"]] ``` Peaks are stored as **GenomicRanges** (or **GRanges**) objects; this is an R format which looks like the bed format, but is optimized in terms of memory requirements and speed of execution. > **_NOTE:_**: when using *GRanges* it is always a good idea to set the `seqinfo` > metadata (info about chromosome names and length essentially). We will do that > later below. We can start by computing some basic statistics on the peak sets. ## How many peaks were called? Compute the number of peaks per dataset. We use here a function from the `apply` family which help to apply recursively a given function on elements of the object. We will often use these functions along the course. The `sapply()` function typically takes a list or vector as input and applies the same function (here length) to each element: ```{r peaksNumber1, include=TRUE} sapply(peaks,length) ``` Note that in this specific case, base R provides a `lengths` function (length**s** with an **s** at the end) that does the same, i.e. gives the length of each element of a list object: ```{r peaksNumber2, include = TRUE, eval = FALSE} lengths(peaks) ``` Make a simple `barplot` showing the number of BRG1 peaks.
Show code: barplot ```{r simplebarplot, include=TRUE, out.width="50%", fig.align = 'center'} barplot(lengths(peaks), ylab = "Number of peaks") # or: # barplot(lengths(peaks), ylab = "Number of peaks", log="y") ```
Let's create a barplot with ggplot2 out of this data.\ Step by step:\ *Do not forget to load the ggplot2 library*
Show code: load ggplot2 ```{r loadggplot, include=TRUE} # Load ggplot2 library library(ggplot2) ```
- Create a data.frame with two columns: IP contains names of the chipped TFs and NbPeaks contains the number of peaks.
Show code: create the data.frame ```{r peakLengths, include=TRUE} # create a table with the data to display peak.lengths <- data.frame(IP = names(peaks), NbPeaks = lengths(peaks)) ```
- use ggplot2 with the appropriate geometric object `geom_*`
Show code: barplot with ggplot2 ```{r enhancedbarplot, include=TRUE, out.width="50%",fig.align = 'center'} # make the barplot ggplot(peak.lengths, aes(x = IP, y = NbPeaks)) + geom_col() ```
- Improve the barplot by changing axis labels and the theme
Show code: modify axes labels and theme ```{r enhancedbarplot2, include=TRUE, out.width="50%",fig.align = 'center'} #make the barplot a bit nicer: ggplot(peak.lengths, aes(x = IP, y = NbPeaks)) + geom_col() + labs(x = "ChIP", y = "Number of peaks") + theme_bw() ```
We can customize the plots by adding colors. There are a number of color palettes that you can use in R and different ways to specify colors. For example, the [Rcolorbrewer](https://cran.r-project.org/web/packages/RColorBrewer/index.html) package is widely used and offers several color palettes: ```{r RcolorBrewer, include=TRUE} # See: library(RColorBrewer) par(mar = c(3,4,2,2)) display.brewer.all() ``` The [ggsci](https://cran.r-project.org/web/packages/ggsci/index.html) package offers color palettes based on scientific journals. If you are searching for a specific color, take a look at [this site](https://www.w3schools.com/cssref/css_colors.php). Now lets add colors to the barplot.\ Step by step:\ - Add the new information fill=IP to let ggplot know that colors change based on chipped protein
Show code: color according to the chipped TF (column IP) ```{r enhancedbarplotfollow1, include=TRUE, fig.show="hold", out.width="50%", fig.align = 'center'} ggplot(peak.lengths, aes(x = IP, y = NbPeaks, fill = IP)) + geom_col() + labs(x = "ChIP", y = "Number of peaks") + theme_bw() ```
We want to use colors from `RColorBrewer` library with the "Set1" color palette. To set your own colors to fill the plot, several functions are available (`scale_fill_*`), here we use `scale_fill_brewer`.
Show code: color according to the chipped TF (column IP) and set RColorBrewer palette to Set1 ```{r enhancedbarplotfollow2, include=TRUE, fig.show="hold", out.width="50%", fig.align = 'center'} ggplot(peak.lengths, aes(x = IP, y = NbPeaks, fill = IP)) + geom_col()+ scale_fill_brewer(palette="Set1") + labs(x = "ChIP", y = "Number of peaks") + theme_bw() ```
And now use one of the `ggsci` palette.
Show code: use a ggsci palette ```{r enhancedbarplotfollow3, include=TRUE, fig.show="hold", out.width="50%", fig.align = 'center'} ## Load ggsci library library(ggsci) ## Make the plot with the NPG palette ggplot(peak.lengths, aes(x = IP, y = NbPeaks, fill = IP)) + geom_col()+ scale_fill_npg() + labs(x = "ChIP", y = "Number of peaks") + theme_bw() ```
## How large are these peaks? The peaks were loaded as a `GRanges` objects.\ - To manipulate them we load the `GenomicRanges` library. Then we use the appropriate function to retrieve the **width** of the peaks from the `GRanges` and use a function to summarize the peak length (display the statistics like minimum, maximum, quartiles, etc.)
Show code: Display peak length statistics for BRG1 only ```{r peakWidth_BRG1, message=FALSE, include=TRUE} ## we use the function width() from GenomicRanges library(GenomicRanges) summary(width(peaks$BRG1)) # you can also try with the skim function from the skimr package: # skimr::skim(width(peaks$BRG1)) ```
Create a simple `boxplot` of the peak sizes.
Show code: Display peak length statistics as a boxplot ```{r simpleboxplot, message=FALSE, include=TRUE, out.width="50%", fig.align = 'center'} # use a log scale on the y axis to facilitate the comparisons peak.width <- lapply(peaks, width) boxplot(peak.width, ylab = "peak width", log = "y") ```
Now, create a nice looking boxplot with `ggplot2`. `ggplot` takes a data frame as input. We can either create a date frame, this is what we have done when we created the barplot. Here, we are going to use the package `reshape2` which, among all its features, can create a data frame from other types of data, in particular list objects, via the `melt` function: ```{r reshape2, message=FALSE, include=TRUE} # Load the package library(reshape2) peak.width.table <- melt(peak.width) head(peak.width.table) ``` Create a ggplot object with correct aesthetics to display a boxplot according the chipped TF (x-axis) and their width (y-axis). Then use appropriate geometric object `geom_*`.
Show code: Display peak length statistics with ggplot2 ```{r boxplot, message=FALSE, include=TRUE, out.width="50%", fig.align = 'center'} # create boxplot ggplot(peak.width.table, aes(x = L1, y = value)) + geom_boxplot() ```
By default, the background color is grey and often does not allow to highlight properly the graphs. Many `theme` are available defining different graphics parameters, they are often added with a function `theme_*()`, the function `theme()` allows to tune your plot parameter by parameter. We propose here to use `theme_classic()` that changes a grey background to white background. Some peaks are very long and squeeze the distribution, one way to make the distribution easier to visualize is to transform the y axis to log scale. Finally, we want to remove the x label, change the y label to 'Peak widths' and the legend relative to the fill color from "L1" to "TF".\ Using the ggplot2 documentation (*e.g.* [tidyverse](https://ggplot2.tidyverse.org/) or [google it](https://www.letmegooglethat.com/?q=ggplot+change+label), try to enhance the boxplot.
Show code: Enhance it ! ```{r enhancedboxplot2, message=FALSE, include=TRUE, out.width="50%", fig.align = 'center'} # - theme_classic() change grey background to white background # - fill=L1 and scale_fill_brewer(palette="Set1") colors boxplots # based on chipped protein and with colors from RColorBrewer Set1 palette # - labs changes x and y axis labels and legend title # - scale_y_log10() set y axis to a log scale so that we can have a nice # view of the data in small values ggplot(peak.width.table, aes(x = L1, y = value, fill = L1)) + geom_boxplot()+ theme_classic()+ scale_fill_brewer(palette = "Set1")+ labs(x = "", y = "Peak sizes", fill = "TF")+ scale_y_log10() ```
## Peak filtering To make sure we keep only high quality data. We are going to select those peaks having a $qvalue >= 4$. In the *GRanges* metadata columns (`mcols`) one of them is the qvalue. So, we are going to set a threshold on this column. How many peaks are selected ?
Show the code: select high quality peaks for BRG1 and count them ```{r peakFilteringBRG1, message=FALSE, include=TRUE} ## Select BRG1 peaks with qvalue>=4 and count them length( peaks$BRG1[peaks$BRG1$qvalue >= 4] ) ```
Now we want to apply this function to all IPs.
Show the code: filter the peaks for all IPs ```{r peakFiltering, message=FALSE, include=TRUE} ## Select BRG1 peaks with qvalue>=4 and count them peakfilt <- lapply(peaks, function(x) { x[x$qvalue >= 4] }) lengths(peakfilt) ```
# Peak annotation relative to genomic features and peak overlaps ## Where are the peaks located across the whole genome? Sometime, peaks may occur more on some chromosomes than others or there can be regions where peaks are absent (example: repetitive regions that have been filtered out). We can display the genomic distribution of peaks along the chromosomes, using the `covplot` function from `ChIPSeeker`. Height of peaks is drawn based on the peak scores. Let's use this function on *SOX10* for example: ```{r peakGenomeDistribution, cache=TRUE, include=FALSE, fig.align = 'center'} # genome wide distribution of SOX10 peaks covplot(peakfilt$SOX10, weightCol = "score") # chromosome wide distribution of SOX10 peaks covplot(peakfilt$SOX10, chrs = c("chr1", "chr2"), weightCol = "score") ``` ## Where are the peaks located relative to genomic features? Here, the aim is to identify where the peaks tend to be located relative to typical genome annotations such as promoters, exons, introns, intergenic regions, etc. We will assign peaks to genomic features (genes, introns, exons, promoters, distal regions, etc.) by checking if they overlap with these features. We will illustrate this task using functions from the *ChIPseeker* package. But first we need to retrieve the annotations ### Retrieve genome annotation (genes, transcripts, exons, etc.) We load ready-to-use annotation objects from Bioconductor: - a OrgDB object: `org.Hs.eg.db` and - a TxDB object: `TxDb.Hsapiens.UCSC.hg38.knownGene` ```{r LoadOrgDbTxDb, eval=TRUE, message=FALSE, include=TRUE} ## org.Hs.eg.db is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers. library(org.Hs.eg.db) ## Load transcript-oriented annotations library(TxDb.Hsapiens.UCSC.hg38.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene ``` Since we now have a *TxDb* object that contains the `seqinfo` metadata, we take the opportunity to transfer this metadata to our peaks. ```{r seqinfo_peaks, include = TRUE, eval=TRUE} peaks <- lapply(peaks, function(x) {seqinfo(x) <- seqinfo(txdb); return(x)}) peakfilt <- lapply(peakfilt, function(x) {seqinfo(x) <- seqinfo(txdb); return(x)}) ``` ### Annotate the peaks relative to genomic features This is done using the function `annotatePeak` which compares peak positions with the coordinates of genomic features on the same reference genome. This function returns a complex object which contains all this information. As we did for the peaks coordinates, we store the peak annotations in a list that we first initialize as an empty list `peakAnno <- list()`. Here, we define the TSS regions -1000 bp to +100 bp from the TSS itself. Using the help of the function and annotation objects we have loaded, try to write the correct command line to annotate BRG1 peaks.
Show code: annotate BRG1 peaks ```{r annotatePeaksBRG1, eval=TRUE, message=FALSE, include=TRUE, cache=TRUE} ## Annotate peaks for all datasets and store it in a list ## Here TSS regions are regions -1000Kb/+100b arount TSS positions ## Peak annotations are stored in a list peakAnno <- list() peakAnno[["BRG1"]] = annotatePeak(peakfilt$BRG1, tssRegion = c(-1000, 100), TxDb = txdb, annoDb = "org.Hs.eg.db") class(peakAnno$BRG1) ## Visualize and export annotation as a data table # as.data.frame(peakAnno$BRG1) head(as.data.frame(peakAnno$BRG1)) ```
All peak information contained in the peak list is retained in the output of `annotatePeak`. Positions and strand information of the nearest genes are reported. The distance from the peak to the TSS of its **nearest gene** is also reported. The genomic region where the peak lies is reported in the *annotation* column. Since some annotation may overlap (e.g. a 5' UTR is always located in an exon), ChIPseeker adopted the following priority in genomic annotations: - Promoter - 5' UTR - 3' UTR - Exon - Intron - Downstream - Intergenic - Downstream (i.e. downstream of the gene 3' end). This hierachy can be customized using the parameter *genomicAnnotationPriority*. The annotatePeak function reports detailed information when the annotation is Exon or Intron, for instance *"Intron (ENST00000624697.4/9636, intron 1 of 2)"* means that the peak is overlapping with the first intron of transcript *ENST00000624697.4*, whose corresponding Entrez gene ID is *9636* (Transcripts that belong to the same gene ID may differ in splice events). The "annoDb" parameter is optional. If it is provided, some extra columns including *SYMBOL*, *GENENAME*, *ENSEMBL/ENTREZID* are added. -- Reminder: The TxDb class is a container for storing transcript annotations. - Bioconductor provides several packages containing *TxDb* objects for model organisms sur as Human and mouse. For instance, `TxDb.Hsapiens.UCSC.hg38.knownGene`, `TxDb.Hsapiens.UCSC.hg19.knownGene` for human genome hg38 and hg19, `TxDb.Mmusculus.UCSC.mm10.knownGene` and `TxDb.Mmusculus.UCSC.mm9.knownGene` for mouse genome mm10 and mm9, etc. - The user can also prepare their own *TxDb* by retrieving information from UCSC Genome Bioinformatics and BioMart data resources using the R functions `makeTxDbFromBiomart` and `makeTxDbFromUCSC` that are now in the [txdbmaker](https://bioconductor.org/packages/release//bioc/html/txdbmaker.html) package. - One can also create a *TxDb* objects for any organism using an annotation file in GTF/GFF format using the function `makeTxDbFromGFF` of the *txdbmaker* package (*GenomicFeatures* package in older R versions).
Expand to find *Coturnix japonica* example ```{r createTxDbGTF, eval=FALSE} ## NOT RUN ## download GTF file download.file("https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/93934/101/GCF_001577835.2_Coturnix_japonica_2.1/GCF_001577835.2_Coturnix_japonica_2.1_genomic.gtf.gz", "Coturnix_japonica_2.1.annotation.gtf.gz") ## Build TxDb object library(GenomicFeatures) txdb = makeTxDbFromGFF("Coturnix_japonica_2.1.annotation.gtf.gz") ## To save the txdb database library(AnnotationDbi) saveDb(txdb, 'txdb.Coturnix_japonica_2.1.sqlite') ## load it when needed library(AnnotationDbi) txdb = loadDb(file = 'txdb.Coturnix_japonica_2.1.sqlite') ```
-- ### Plot the results of peak annotation We can vizualize the way our peaks have been annotated relative to genomic features (introns, exons, promoters, distal regions,...) using a pie chart with `plotAnnoPie` or as a bar chart `plotAnnoBar` which facilitates the comparison between different sets of peaks.
Show code: Display peak annotations of BRG1 peaks using a pie chart ```{r plotAnnoPie_BRG1, include=TRUE, eval=TRUE, fig.align = 'center'} ## distribution of genomic features for BRG1 peaks # as a pie chart - which is the most widely used representation in publication plotAnnoPie(peakAnno$BRG1) ```
Now we generate annotations for SOX10 and MITF peaks.
Show code: Generate annotations for SOX10 and MITF peaks ```{r annotatePeak_SOX10_MITF, include=TRUE, eval=TRUE, warning=FALSE, message=FALSE, cache=TRUE} ## Use the annotatePeak function peakAnno[["SOX10"]] <- annotatePeak(peakfilt$SOX10, tssRegion = c(-1000, 100), TxDb = txdb, annoDb = "org.Hs.eg.db") peakAnno[["MITF"]] <- annotatePeak(peakfilt$MITF, tssRegion = c(-1000, 100), TxDb = txdb, annoDb = "org.Hs.eg.db") ## We could also have generated annotations for all factors at the same time using: # peakAnno <- lapply(peakfilt, annotatePeak, tssRegion=c(-1000, 100), TxDb=txdb, annoDb="org.Hs.eg.db") ```
Now display the results as a stacked barplot using `peakAnnoBar`.
Show code: Generate stacked barplot of peak annotations ```{r plotAnnoBar, include=TRUE, eval=TRUE} ## Use the plotAnnoBar function plotAnnoBar(peakAnno) ```
However since some annotation overlap, ChIPseeker also provides functions that help evaluate full annotation overlaps. For example, the `upsetplot` function which produces an [UpSet plot](https://en.wikipedia.org/wiki/UpSet_plot) similar to those produced by the `library(UpSetR)` package. ```{r upsetplot, include=TRUE, eval=TRUE, warning=FALSE, fig.align = 'center', cache=TRUE} upsetplot(peakAnno$BRG1) ``` ## Do the peaks of the different factors overlap? ### Venn Diagram We see that the peaks for the 3 factors SOX10, MITF and BRG1 are present to a certain extent in similar regions (although this could be due to the proportions of these regions in the genome). But this doesn't tell us if the peaks actually overlap. An overlap between the peaks may mean that the factors co-localize, although we won't know if it's the case in single cells. Overlaps between peak datasets can be evaluated using a Venn diagram. Such an approach is implemented in the [ChIPpeakAnno](https://www.bioconductor.org/packages/release/bioc/html/ChIPpeakAnno.html) package. After loading the library, we start by computing the overlap between peak sets, keeping the information of all peaks overlapping in each set (see `?findOverlapsOfPeaks` for help). ```{r ovlPeaks, include=TRUE, eval=TRUE, warning=FALSE, message=FALSE, cache=TRUE} library(ChIPpeakAnno) ovl <- findOverlapsOfPeaks(peakfilt, connectedPeaks = "keepAll") ``` *ChIPpeakAnno* imposes, while plotting the Venn diagram, to compute the significance of the pairwise associations using a hypergeometric test. To this end, we need to estimate the number of all potential binding events which is used by the `makeVennDiagram` function through the `totalTest` number. It is used for the hypergeometric sampling that is used to determine if the overlap between two datasets is more than would be expected by chance. This is not a trivial question, the answer is driven by what you know about the binding properties of your factors (eg. sequence specific, mainly intergenic etc). You can find an interesting discussion [here](https://stat.ethz.ch/pipermail/bioconductor/2010-November/036540.html). In our case we can refer to the genomic distribution of the peaks that we have plotted previously. We can assume that our TFs have a gene body binding preference. Genes cover roughly 10% of the genome. ```{r defineHyperTot, include=TRUE, eval=TRUE, warning=FALSE, message=FALSE} # Estimate the average size of the peaks ... averagePeakWidth <- mean(width(unlist(GRangesList(ovl$peaklist)))) # ... to count how many potential sites could have been bound in coding regions. tot <- ceiling(3.3e+9 * 0.1 / averagePeakWidth) ``` **TIPS**: We can define the colors attributed to each set using the `colours()` function. Whenever you want to set a color you can use this function. Here are some useful pages about colors in R and elsewhere: - [colorchart](https://github.com/EarlGlynn/colorchart/wiki/Color-Chart-in-R) - [color palettes](https://www.nceas.ucsb.edu/sites/default/files/2020-04/colorPaletteCheatsheet.pdf) - [viridis palette](https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html) - [CSS colors](https://www.w3schools.com/cssref/css_colors.php) Now we can plot the overlap of the peaks as a Venn Diagram: ```{r VennAnno, include=TRUE, fig.width=7, fig.height=7, warning=FALSE, message=FALSE, results='hide', fig.align = 'center', cache=TRUE} makeVennDiagram(ovl, totalTest = tot, connectedPeaks = "keepAll", fill = brewer.pal(3, "Set1"), # circle fill color col = brewer.pal(3, "Set1"), #circle border color cat.col = brewer.pal(3, "Set1")) ``` According to the hypergeometric test p-values all pairwise comparisons are highly significant. This has to be taken very carefully as it largely depends on the background estimation that we may have over-estimated. If you are not sure about your estimation, you can use a non-parametric approach based on your peak genomic distribution to estimate randomness. To this end the `TxDb.Hsapiens.UCSC.hg38.knownGene` is used. #### Are the overlaps significant? ChIPpeakAnno provides the `preparePool` and `peakPermTest` functions to compute this test. These tests are made by pair, we will look at SOX10/MITF overlap significance as an example. ```{r permPeaks, include=TRUE, warning=FALSE, message=FALSE, cache=TRUE, fig.align = 'center', cache=TRUE} # Prepare a pool of random peaks following the characteristics of our peak sets pool <- preparePool(txdb, peaks$SOX10, bindingType = "TSS", featureType = "transcript", seqn = paste0("chr",c(1:22, "X", "Y"))) # Create the permPool object needed for peakPermTest pool <- new("permPool", grs=pool$grs[1], N=length(peaks$SOX10)) SOX10.MITF <- peakPermTest(peaks$SOX10, peaks$MITF, pool = pool, seed = 123, force.parallel = FALSE) SOX10.MITF plot(SOX10.MITF) ``` Venn diagrams are widely used to represent overlaps, intersections. However, in ChIP-seq analysis, the definition of the peaks is dependent on the peak caller, the FDR thresholds (what about peaks just below the threshold?). The overlap is also difficult to assess, indeed do we call an overlap of 1 nucleotide an intersection ? It is one of the numerous parameters that can be tuned... Another way to explore the results is to evaluate if different combinations of the factors are associated to specific genomic features. We can use again the `annotatePeak` function to evaluate this: ```{r annotPeakVennSets, include=TRUE, message = FALSE, warning=FALSE, cache=TRUE} coocs <- as(ovl$peaklist, "GRangesList") # apply annotatePeak to each set of peaks in the list coocs peakAnnoList <- lapply(coocs, annotatePeak, TxDb = txdb, annoDb = "org.Hs.eg.db", tssRegion = c(-1000, 100), verbose = FALSE) ``` And then plot the results: ```{r plotannotPeakVennSets, include=TRUE, cache=TRUE, fig.align='center'} # Plot peak distribution relative to gene features plotAnnoBar(peakAnnoList) # Plot peak distribution relatively to their distance to the TSS plotDistToTSS(peakAnnoList) ``` # Heatmaps and average profiles Heatmaps and average profiles are widely used representations of ChIP-seq data as they allow simultaneous visualization of read enrichment at various locations. For instance, one may want to represent reads related to a chipped protein in regions spanning +/-5Kb around all TSS of the reference genome. Another objective could be to compare read enrichment at the same locations in many chip-seq datasets. ## Distribution of BRG1 peaks We want to know if BRG1 is binding large and/or narrow regions, unique and/or tandem etc. So we will study the distribution of the BRG1 peaks that have been called in a region spanning +/-2Kb around the peak centers. First of all, for calculating the profile of ChIP peaks around the BRG1 peak center, we need to define BRG1 peak centers and extend them 2Kb on each side. We will then select the 10000 best peaks using the q-value to get a lighter matrix. Then, we need to "align" the peaks that are mapped to these regions, and thus generate a matrix with the different regions in row and the position around the peak centers in columns. The resulting matrix will be 2000\*2+1 columns and 10000 rows. Step by step:\ - Compute the centers of BRG1 peaks using the GenomicRanges `resize`` function.
Show code: compute the peak centers for BRG1 ```{r BRG1peakcenters, message=FALSE, eval=TRUE, cache=TRUE} # Compute the centers of BRG1 peaks peak_centers_BRG1 <- resize(peakfilt$BRG1, width = 1, fix = "center") ``` Note that the `mid` function applied to a *GRanges* object returns the centers of the ranges but as an integer vector, not a *GRanges* object.
- Now select the BRG1 peak centers that correspond to the 10,000 most significant BRG1 peaks (based on qvalue)
Show code: Select peak centers for the top 10k peaks ```{r BRG1peaktop10K, message=FALSE, eval=TRUE, cache=TRUE} # For computation and memory efficiency reasons, # we subset the top 10K peaks according to the qvalue top10k_BRG1 <- peak_centers_BRG1[order(peak_centers_BRG1$qvalue, decreasing = TRUE)][1:10000] ```
- Create a `GRanges` object with the regions corresponding to +/-2kb around these peak centers.
Show code: Create GRanges with +/-2kb around top10k BRG1 peak centers ```{r BRG1peakcenterGR, message=FALSE, eval=TRUE, cache=TRUE} # Generate GRanges of peak centers +/-2kb for the 10K top peaks top10k_BRG1_extend2kb <- top10k_BRG1 + 2000 # Note that we could have obtained this GRanges for all peaks directly using: # BRG1_extend2kb <- resize(peakfilt$BRG1, width = 4001, fix = "center")) ```
### BRG1 heatmap and average profile with ChIPseeker First, we will use the functions available in the *ChIPseeker* package to generate the heatmap. Later, we will compare the results to those obtained with the *genomation* package. - *ChIPseeker* actually requires that we generate the regions of interest (top10k BRG1 peak centers +/-2kb) using the `makeBioRegionFromGranges`.
Show code: prepare the "BioRegions" using ChIPseeker function ```{r top10kBRG1BioRegions, message=FALSE, eval=TRUE, cache=TRUE} BRG1_extend_2kb <- makeBioRegionFromGranges(top10k_BRG1, by = "peaks", type = "start_site", upstream = 2000, downstream = 2000) ```
- We now have the regions that we need to map the peaks to. Create the matrix using the `getTagMatrix` function and the parameter `windows`.
Show code: Create the matrix ```{r tagMatrixAroundBRG1center, message=FALSE, warning=FALSE, eval=TRUE, fig.align='center', cache=TRUE} ## compute the coverage of peaks in these regions BRG1_tagmat <- getTagMatrix(peakfilt$BRG1, windows = BRG1_extend_2kb) ```
- Now display the matrix with `tagHeatmap`.
Show code: Display the heatmap ```{r tagHeatmap_BRG1, warning = FALSE, message=FALSE, eval=TRUE, fig.align='center', cache=TRUE} ## plot the heatmap tagHeatmap(BRG1_tagmat) ```
We will see below that the rows have been reordered by their average signal (i.e. what you would get by using `rowMeans` on the tagMatrix). Here we used the centers of BRG1 peaks to define the regions of interest, so it is not surprising that peaks are nicely aligned on these centers and, because of the sorting applied to the rows, that wide peaks are at the top and narrower peaks at the bottom of the heatmap. - Now let's see how the distribution of peaks look around the TSS. *ChIPseeker* provides an easy function to do this: `peakHeatmap` that uses directly the *TxDb* object that we used previously.
Show code: Plot BRG1 peak distribution around the TSS ```{r plotHeatmap_BRG1_TSS, warning=FALSE, message=FALSE, eval=TRUE, fig.align='center', cache=TRUE} ## plot the heatmap of BRG1 peaks around the TSS peakHeatmap(peakfilt$BRG1, TxDb = txdb, upstream = 2000, downstream = 2000) ```
We can also display the average profile of BRG1 peaks across these regions. This corresponds to calculating the `mean` signal of each column of the matrix and plotting the result. So these "average profiles" are sometimes plotted above the heatmap. In *ChIPseeker* we can do this on the *tagMatrix* object using the `plotAvgProf` function.
Show code: Plot the average profile of BRG1 peaks around their centers ```{r plotAvgProf_BRG1, message=FALSE, eval=TRUE, fig.align = 'center', cache=TRUE} ## compute the coverage of peaks in these regions plotAvgProf(BRG1_tagmat, xlim = c(-2000, 2000), xlab = "Distance to peak center", ylab = "Peak Count Frequency") ``` > **_NOTE:_**: this is especially useful when groups of regions have been > defined on the matrix, for example representing different binding profiles > of the factor.
### BRG1 heatmap with genomation Now let's see how to obtain similar results with the *genomation* package. Step by step:\ - The equivalent of the `tagMatrix` function is `SignalMatrix`, but the latter directly accepts a regular *GRanges* object to specify the windows / regions of interest.
Show code: Prepare the matrix ```{r ScoreMatrix_BRG1_peaks, warning=FALSE, message=FALSE, eval=TRUE, cache=TRUE} ## Prepare the matrix of BRG1 peaks across BRG1 peak centers +/-2kb sm_BRG1 <- ScoreMatrix(target = peakfilt$BRG1, windows = top10k_BRG1_extend2kb) ```
- Plot the heatmap with `heatMatrix`.
Show code: Plot the heatmap ```{r heatMatrix_BRG1, warning=FALSE, message=FALSE, eval=TRUE, fig.align='center', cache=TRUE} ## Plot the heatmap heatMatrix(sm_BRG1, xcoords = c(-2000, 2000)) ```
You notice that the heatmap doesn't look like the one generated by the *ChIPseeker* function `tagHeatmap`. This is because by default, the `heatMatrix` function does not reorder the rows. Since we have sorted the rows by decreasing qvalue, we can verify this by using the qvalue as a weight when calculating the coverage of BRG1 peaks over the regions. This is done by using the `weight.col` argument in `ScoreMatrix`. - Color the coverage by qvalue using the `weight.col` argument in `ScoreMatrix`.
Show code: Plot the heatmap with qvalue as weight ```{r heatMatrix_BRG1_qval, warning=FALSE, message=FALSE, eval=TRUE, fig.align='center', cache=TRUE} ## Plot the heatmap using qvalue as weight heatMatrix(ScoreMatrix(target = peakfilt$BRG1, windows = top10k_BRG1_extend2kb, weight.col = "qvalue"), xcoords = c(-2000, 2000)) ```
- Now use the `order` argument in `heatMatrix` to reorder the rows
Show code: Plot the heatmap reordering the rows ```{r heatMatrix_BRG1_order, warning=FALSE, message=FALSE, eval=TRUE, fig.align='center', cache=TRUE} ## Plot the heatmap with reordered rows heatMatrix(sm_BRG1, xcoords = c(-2000, 2000), order = TRUE) ```
Take a look at `?heatMatrix` and at the *genomation* vignette, there are other useful arguments, for example to make groups or cluster rows. - Now let's plot the distribution of BRG1 peaks around the TSS using *genomation*. To do this, we will use the `gene` and `promoters` functions on the *TxDb* object to extract the coordinates of the regions around the TSS.
Show code: Heatmap of BRG1 peaks around the TSS with *genomation* ```{r heatMatrix_BRG1_TSS, warning=FALSE, message=FALSE, eval=TRUE, fig.align='center', cache=TRUE} ## we also add a confidence interval around the mean heatMatrix(ScoreMatrix(target = peakfilt$BRG1, windows = promoters(genes(txdb), upstream = 2000, downstream = 2001)), xcoords = c(-2000, 2000), order = TRUE) ``` Again, we notice that *ChIPseeker* function had filtered out all rows with zero coverage around the TSS, while *genomation* function does not.
- *genomation* has the `plotMeta` function to create average profiles
Show code: Make an average profile with genomation ```{r plotMeta_BRG1, warning=FALSE, message=FALSE, eval=TRUE, fig.align='center', cache=TRUE} ## Average profile with genomation ## we also add a confidence interval around the mean plotMeta(sm_BRG1, xcoords = c(-2000, 2000), dispersion = "se") ```
### Multiple heatmaps: comparing factors across identical regions Heatmaps are useful to compare the profiles obtained from different factors on the same regions. For examples, we could ask: How is the signal for BRG1 and MITF distributed around SOX10 peaks. *genomation* has useful functions to analyse a list of several ChIP-seq results, provided as *GRanges* or *RleList* objects. Step by step:\ - Get the +/-2kb regions around SOX10 peak centers.
Show code: Get +/-2kb regions around SOX10 peak centers ```{r SOX10_extend2kb, warning=FALSE, message=FALSE, eval=TRUE, cache=TRUE} ## Obtain regions of interest (SOX10 peak centers +/- 2kb) SOX10_extend2kb <- resize(peakfilt$SOX10, width = 4001, fix = "center") ```
- Use the `ScoreMatrixList` function to obtain the profiles of all factors in these regions.
Show code: Get the ScoreMatrix for each factors ```{r ScoreMatrixList_SOX10_centers, warning=FALSE, message=FALSE, eval=TRUE, cache=TRUE} ## Get the coverage of peaks for all factors around SOX10 peak centers sm_sox10peaks <- ScoreMatrixList(target = peakfilt[c("SOX10", "BRG1", "MITF")], windows = SOX10_extend2kb) ```
- Plot the heatmaps side by side using `multiHeatMatrix`.
Show code: Plot the 3 heatmaps side by side ```{r multiHeatMatrix_SOX10_centers, warning=FALSE, message=FALSE, eval=TRUE, cache=TRUE} ## Get the coverage of peaks for all factors around SOX10 peak centers ## We order the rows by the average signal across all matrices and use a common color scale for all heatmaps multiHeatMatrix(sm_sox10peaks, xcoords = c(-2000, 2000), order = TRUE, common.scale = TRUE) ```
There seems to be little link between SOX10 and MITF binding but there is some link between BRG1 binding and SOX10 ## Read enrichment in regions of interest So far, we're only looking at heatmaps of peak presence/absence. Even if we sort the regions by their intensity or by qvalue, this gives little quantitative information on the actual links between the factors. To dig into this, we will look at the coverage of reads, instead of peaks, in specific regions. ### Heatmap of BRG1 read coverage We will first look at the enrichment of reads from the BRG1 ChIP in regions covering +/-5kb around BRG1 peak centers. Because this represent ~10kb regions we will create bins and summarize peak density within these bins. Step by step:\ - Get the +/-5kb regions around BRG1 peak centers (only the top10K peaks).
Show code: Get +/-5kb regions around BRG1 peak centers ```{r extended.5K.BRG1, warning=FALSE, message=FALSE, eval=TRUE, cache=TRUE} ## Obtain regions of interest (BRG1 peak centers +/- 5kb) extended.5K.BRG1 <- top10k_BRG1 + 5000 ```
- Import the read coverage for BRG1 ChIP.
Show code: Import the bigwig file for BRG1 using *rtracklayer* `import` function ```{r cvg.BRG1, warning=FALSE, message=FALSE, eval=TRUE, cache=TRUE} ## Import BRG1 bigwig file as an RleList cvg.BRG1 <- import(file.path(params$workdir, "data", "ChIPseq", params$brg1.bw), format = "BigWig", which = extended.5K.BRG1, as = "RleList") cvg.BRG1 ``` The read coverage is imported as an `RleList` where each element represents a chromosome and is a long vector of coverage values along this chromosome, represented as a read-length encoded (`Rle`) vector. Here, we used the *rtracklayer* `import` function but below we will also illustrate the `readBigWig` function from the *genomation* package.
- Summarize the coverage in 200 bins.
Show code: Use the `ScoreMatrixBin` function from *genomation* tu summarize the coverage in 200 bins ```{r smr_BRG1, warning=FALSE, message=FALSE, eval=TRUE, cache=TRUE} ## Obtain the matrix with 200 bins smr_BRG1 <- ScoreMatrixBin(target = cvg.BRG1, windows = extended.5K.BRG1, bin.num = 200) ```
- Plot the heatmap.
Show code: Use the `heatMatrix` function to plot the heatmap ```{r smr_BRG1_heatMatrix, warning=FALSE, message=FALSE, eval=TRUE, fig.align='center', cache=TRUE} ## Plot the heatmap, ordering the regions by their average intensity heatMatrix(smr_BRG1, xcoords = c(-5000, 5000), order = TRUE) ```
- A few extreme values in the matrix are generating a color scale that doesn't allow us to visualize the majority of the data. Use the `winsorize` argument in `heatMatrix` to correct the color scale.
Show code: Use the `winsorize` argument in `heatMatrix` to adjust the color scale ```{r smr_BRG1_heatMatrix_winsorized, warning=FALSE, message=FALSE, eval=TRUE, fig.align='center', cache=TRUE} ## Plot the heatmap, ordering the regions by their average intensity heatMatrix(smr_BRG1, xcoords = c(-5000, 5000), winsorize = c(0, 99.9), order = TRUE) ```
- Try the `scaleScoreMatrix` function from *genomation* to see the effect on the heatmap
Show code: Plot the heatmap of a scaled matrix ```{r smr_BRG1_heatMatrix_scaled, warning=FALSE, message=FALSE, eval=TRUE, fig.align='center', cache=TRUE} ## Plot the heatmap for the scaled matrix heatMatrix(scaleScoreMatrix(smr_BRG1), xcoords = c(-5000, 5000), order = TRUE) ```
- Identify different patterns of read coverage using the `clustfun`argument
Show code: Use the `clustfun` argument to group the peaks based on their patterns of read coverage by using k-means clustering (e.g. 5 groups) ```{r smr_BRG1_heatMatrix_kmeans, warning=FALSE, message=FALSE, eval=TRUE, fig.align='center', cache=TRUE} ## Make 5 groups of peaks based on the pattern of read coverage around the peaks set.seed(123) # for reproducibility since we're using k-means clustering heatMatrix(smr_BRG1, xcoords = c(-5000, 5000), winsorize = c(0, 99.9), clustfun = function(x) kmeans(x, centers=5)$cluster) ``` In practice we would probably prefer to perform the clustering outside of the call to the `heatMatrix` function. But when the `clustfun` argument is used, the function returns the clustering result so that we can reuse it for other analysis.
### MultiHeatmap to compare read coverage of the 3 factors Now let's investigate the read coverage of the 3 factors around SOX10 peaks. We will choose a window of +/-3kb around SOX10 peaks and sort them by decreasing qvalue. We will also summarize the coverages across these regions using 300 bins. Step by step:\ - Get the +/-3kb regions around SOX10 peak centers sorted by decreasing qvalues.
Show code: Get +/-3kb regions around sorted SOX10 peak centers ```{r SOX10_extend3kb_sorted, warning=FALSE, message=FALSE, eval=TRUE, cache=TRUE} ## Obtain regions of interest (SOX10 peak centers +/- 2kb) SOX10_extend3kb_sorted <- resize(peakfilt$SOX10[order(peakfilt$SOX10$qvalue, decreasing = TRUE)], width = 6001, fix = "center") ```
- Import the bigwig files for the 3 factors using the `readBigWig` function from the *genomation* package.
Show code: Import read coverage for the 3 factors ```{r import_readBigWig, warning=FALSE, message=FALSE, eval=TRUE, cache=TRUE} ## Import read coverage for the 3 factors cvg <- list() cvg[["SOX10"]] <- readBigWig(file.path(params$workdir, "data", "ChIPseq", params$sox10.bw), windows = SOX10_extend3kb_sorted) cvg[["BRG1"]] <- readBigWig(file.path(params$workdir, "data", "ChIPseq", params$brg1.bw), windows = SOX10_extend3kb_sorted) cvg[["MITF"]] <- readBigWig(file.path(params$workdir, "data", "ChIPseq", params$mitf.bw), windows = SOX10_extend3kb_sorted) ```
- Get the ScoreMatrices using the `ScoreMatrixList`
Show code: Compute score matrices for the 3 factors using 300 bins ```{r smr_at_SOX10peaks, warning=FALSE, message=FALSE, eval=TRUE, cache=TRUE} ## Obtain the score matrices smr_at_SOX10peaks <- ScoreMatrixList(target = cvg, windows = SOX10_extend3kb_sorted, bin.num = 300) ```
- Plot the heatmaps (use winsorization to adjust the color scale)
Show code: Plot the heatmaps using `multiHeatMatrix` ```{r multiHeatMatrix_SOX10peaks, warning=FALSE, message=FALSE, eval=TRUE, fig.align='center', cache=TRUE} ## Plot the heatmaps side-by-side multiHeatMatrix(smr_at_SOX10peaks, xcoords = c(-3000, 3000), winsorize = c(0, 99.9)) ``` Note that if you use the argument `order = TRUE`, the rows will be sorted by the average of the signal across the 3 matrices, which is why it is important to sort the regions of interest before plotting the heatmaps and to use a function that keeps this row order.
### Customized heatmaps with the EnrichedHeatmap package So far, we have used functions from *ChIPseeker* and *genomation* to plot the heatmaps. While these functions offer some level of customization, you may need further customization in certain cases. The [EnrichedHeatmap](https://github.com/jokergoo/EnrichedHeatmap) package offers an almost complete level of customization to plot heatmaps and heatmap annotations from functional genomic experiments. The package is built on the [ComplexHeatmap](https://github.com/jokergoo/ComplexHeatmap) package, which has an extensive documentation and examples. As an example, let's plot the enrichment of SOX10 and BRG1 reads around SOX10 peaks. Step by step:\ - First convert the *ScoreMatrix* objects to *normalizedMatrix* objects that are used by *EnrichedHeatmap*. You can create a normalizedMatrix using the `normalizeToMatrix` function but since we already computed the matrices, we will instead convert them using `as.normalizedMatrix`
Show code: Convert a ScoreMatrix to a normalizedMatrix ```{r normalizedMatrix, warning=FALSE, message=FALSE, eval=TRUE, cache=TRUE} ## Convert signal matrices to normalizedMatrix objects normat <- list() ## For BRG1 signal: normat[["BRG1"]] <- as.normalizedMatrix( as(smr_at_SOX10peaks[["BRG1"]], "matrix"), k_upstream = 150, k_downstream = 150, k_target = 0, extend = c(3000, 3000), signal_name = "BRG1", target_name = "SOX10 peak centers" ) ## For SOX10 signal: normat[["SOX10"]] <- as.normalizedMatrix( as(smr_at_SOX10peaks[["SOX10"]], "matrix"), k_upstream = 150, k_downstream = 150, k_target = 0, extend = c(3000, 3000), signal_name = "SOX10", target_name = "SOX10 peak centers" ) ```
- Plot BRG1 heatmap using default parameters
Show code: Plot BRG1 EnrichedHeatmap ```{r EnrichedHetmap_defaults_BRG1, warning=FALSE, message=FALSE, eval=TRUE, fig.align='center', cache=TRUE} ## BRG1 EnrichedHeatmap EnrichedHeatmap(normat$BRG1, name = "BRG1") ``` We notice that the heatmap is sorted by increasing average signal across the regions.
- Now use the syntax of *EnrichedHeatmap* to plot both SOX10 and BRG1 heatmaps
Show code: Plot SOX10 + BRG1 EnrichedHeatmaps ```{r EnrichedHetmap_defaults_SOX10_BRG1, warning=FALSE, message=FALSE, eval=TRUE, fig.align='center', cache=TRUE} ## SOX10 + BRG1 EnrichedHeatmap EnrichedHeatmap(normat$SOX10, name = "SOX10") + EnrichedHeatmap(normat$BRG1, name = "BRG1") ``` The first heatmap (SOX10) is ordered by increasing average signal and the second heatmap (BRG1) follows the same order as the first one so that we can compare the signal of the 2 factors in the same regions.
- Now let's cluster the BRG1 signal into 5 groups using kmeans
Show code: cluster BRG1 signal into 8 groups ```{r kmeans_BRG1, warning=FALSE, message=FALSE, eval=TRUE, cache=TRUE} ## make 5 clusters of row using kmeans set.seed(1234) #for reproducibility km5 <- kmeans(as.matrix(normat$BRG1), centers=5) # keep the asignment of the rows to clusters part5clust = km5$cluster ```
- Take a look at the quantiles of the 2 signal matrices to choose the color scales
Show code: Based on the quantiles of the signal matrices create color scales for the heatmaps. The color scales are created with the `colorRamp2` function from the [circlize](https://github.com/jokergoo/circlize) package which linearly interpolates colors according to break values. ```{r colorRamp2_BRG1 and SOX10, warning=FALSE, message=FALSE, eval=TRUE, cache=TRUE} ## Look at the quantiles of the BRG1 matrix quantile(as(smr_at_SOX10peaks[["BRG1"]], "matrix"), seq(0.95,1,0.01)) ## and for SOX10 quantile(as(smr_at_SOX10peaks[["SOX10"]], "matrix"), seq(0.95,1,0.01)) ## We pick 15 as the max color value (~99th quantile of the matrix) for BRG1 col_brg1 <- colorRamp2(breaks = c(0,10,15), colors = c("white", "blue","black")) ## and 8 for SOX10 col_sox10 <- colorRamp2(breaks = c(0,4,8), colors = c("white", "orange","red")) ```
- Plot the heatmaps of the partitioned SOX10 and BRG1 signals at SOX10 binding sites using EnrichedHeatmap
Show code: EnrichedHeatmaps of SOX10 and BRG1 ```{r SOX10_BRG1_kmeans_Hmap, warning=FALSE, message=FALSE, eval=TRUE, fig.align='center', cache=TRUE} ## Create a vector of the colors that will be used for the clusters clustcolors <- brewer.pal(5,"Set1") names(clustcolors) <- 1:5 ## Create a legend object for the partition lgd <- Legend(at = paste0("cluster", 1:5), title = "Clusters", type = "lines", legend_gp = gpar(col = clustcolors)) ## In the plot we want: ## 1) the partition in cluster 1 to 8 ## 2) the heatmap of BRG1 with the average profiles for the different clusters ## 3) the heatmap of SOX10 with the average profiles for the different clusters ## Define the heatmap list with these 3 objects: ht_list <- Heatmap(part5clust, col = clustcolors, cluster_rows = FALSE, name = " ", width = unit(3, "mm"), show_heatmap_legend = FALSE) + EnrichedHeatmap(normat$BRG1, name = "BRG1", col=col_brg1, top_annotation = HeatmapAnnotation(lines = anno_enriched(gp = gpar(col = clustcolors))), column_title = "BRG1", axis_name = c("-3kb", "SOX10\npeak center", "+3kb")) + EnrichedHeatmap(normat$SOX10, name = "SOX10", col=col_sox10, top_annotation = HeatmapAnnotation(lines = anno_enriched(gp = gpar(col = clustcolors))), column_title = "SOX10", axis_name = c("-3kb", "SOX10\npeak center", "+3kb")) ## Draw the heatmap while spliting the rows based on the clusters: draw(ht_list, split = part5clust, annotation_legend_list = list(lgd), ht_gap = unit(c(2, 8, 8), "mm")) ```
# Analyzing ChIP-seq data together with RNA-seq data In this paragraph, we will include gene expression data from RNA-seq in our analyses. Let's import the processed data. ```{r loadRNaseq_results, include=TRUE, eval=TRUE, cache=TRUE} # load RNA-seq normalized expression data RNAseq <- readRDS(file.path(params$workdir, "data", "RNAseq", params$rnaseq)) # RNAseq head(RNAseq) ``` RNA-seq columns: - Ensembl.Gene.ID: Gene idenfier from ensEMBL (ENSG\*\*\*\*\*) - WT.norm: normalized expression for wild type cell lines - shBGR1.norm: normalized expression for shBRG1 cell lines - logCPM: average log transformed overall expression - logFC: log fold-change between the two conditions - FDR: multiple-testing corrected p-value from edgeR statistical test - signif (up, down, stable): significance was set as: $FDR < 0.01$, $logCPM > 5$, $abs(logFC) > 1$ Reads from RNA-seq were assigned to genes annotated in ensEMBL and thus are identified with ensEMBL ID. However, in the TxDb package that we loaded, the genes are annotated with ENTREZ IDs. Thus, we need to map the identifiers to each other. > **_NOTE:_**: it is generally a good idea to stick to a single annotation file > to analyze all your data. But sometimes, you get data from other sources (a > colleague, a public database, etc.) so you don't have the choice. We use the `OrgDB` object for the human genome to do this. The available information are displayed using `columns()`. ```{r, include=FALSE, eval=FALSE} columns(org.Hs.eg.db) ``` We then use the function `mapIds` from the `AnnotationDbi` package with the keys `ENSEMBL` and `ENTREZID`. You can see that other identifiers are available from this object (e.g: SYMBOL, REFSEQ, UNIPROT etc).
Show code: map ENSEMBL IDs to ENTREZ IDs ```{r mapENSEMBLid2ENTREZid, include=TRUE, eval=TRUE, cache=TRUE} conv <- mapIds(x = org.Hs.eg.db, keys = RNAseq$Ensembl.Gene.ID, # what we want to map keytype = "ENSEMBL", # what type of IDs are the keys column = "ENTREZID") # which type of ID we want head(conv) # names of the vector are ENSEMBL IDs and elements are ENTREZ IDs ``` The warning indicates that there is not a perfect one-to-one match between ENSEMBL ids and ENTREZ ids.
Let's take a closer look at the mappings.
Show code: Explore the conv object ```{r checkIDmappings, eval=TRUE, cache=TRUE} length(conv) #number of ENSEMBL ids that were mapped (all the rows of RNAseq) sum(is.na(conv)) #number of ENSEMBL ids that could not be mapped to an ENTREZ ID length(conv[!is.na(conv)]) # Number of ENSEMBL ids that were mapped to an ENTREZ ID length(unique(conv[!is.na(conv)])) # Number of unique ENTREZ IDs that corresponded to at least one ENSEMBL ID table(conv[!is.na(conv)])[table(conv[!is.na(conv)]) > 1] # ENTREZ IDs that were mapped to more than one ENSEMBL IDS ```
The names of the `conv` object should be in the same order as in the column `Ensembl.Gene.ID` of the `RNAsseq` table. We verify it and then assign the `conv` vector as a new `ENTREZID` column in the `RNAseq` dataset Let's take a closer look at the mappings.
Show code: Create an *ENTREZID* column in the *RNASeq* data frame ```{r AddENTREZIDcolumn, eval=TRUE, cache=TRUE} identical(names(conv), RNAseq$Ensembl.Gene.ID) RNAseq$ENTREZID <- conv ```
We will now focus on the TSS of the genes and take a look at the binding of the factors around the TSS. First, we extract a `GRanges` object with all the TSS.
Show code: Create a *GRanges* object with the TSS for all genes ```{r extractTSS, eval=TRUE, cache=TRUE} tss <- promoters(genes(txdb), upstream = 0, downstream = 1) ```
Then, we filter this `tss` object to keep only the TSS for which we have gene expression data. ```{r filter TSS, eval=TRUE, cache=TRUE} tss <- tss[names(tss) %in% RNAseq$ENTREZID[!is.na(RNAseq$ENTREZID)]] ``` Let's sort the TSS by average gene expression in the cell lines (`logCPM` column). We use the `match` function to find a match between the names of `tss` and the corresponding row in the `RNAseq` dataset ```{r addlogCPM_to_TSS, eval=TRUE, cache=TRUE} ## Add a metadata column with the logCPM from the RNAseq dataset mcols(tss)$logCPM <- RNAseq$logCPM[match(names(tss), RNAseq$ENTREZID)] ## sort tss by decreasing logCPM tss <- tss[order(tss$logCPM, decreasing = TRUE)] ``` Import the read coverage for the 3 factors at the TSS
Show code: Import the read coverage for SOX10, BRG1 and MITF at the TSS ```{r cvg_tss, eval=TRUE, cache=TRUE} cvg_tss <- list() cvg_tss[["SOX10"]] <- readBigWig(file.path(params$workdir, "data", "ChIPseq", params$sox10.bw), windows = tss + 4000) cvg_tss[["BRG1"]] <- readBigWig(file.path(params$workdir, "data", "ChIPseq", params$brg1.bw), windows = tss + 4000) cvg_tss[["MITF"]] <- readBigWig(file.path(params$workdir, "data", "ChIPseq", params$mitf.bw), windows = tss + 4000) ```
Now, we use *genomation* to produce the heatmaps of the different factors in regions covering +/-4kb around the TSS
Show code: Plot a heatmap of the signal for the 3 factors at the TSS ```{r getTSSscoreMatrices, eval=TRUE, warning=FALSE, message=FALSE, fig.align='center', cache=TRUE} #Get the score matrices around the TSS smr_at_TSS <- ScoreMatrixList(target = cvg_tss, windows = tss + 4000, bin.num = 400) #plot the heatmap multiHeatMatrix(smr_at_TSS, xcoords = c(-4000, 4000), winsorize = c(0, 99.95)) ```
There doesn't seem to be a very clear association between the binding of the 3 factors and the gene expression level as measured by `logCPM`. Note that `logCPM` is an imperfect measure of the relative gene expression when comparing different genes because it doesn't take into account the length of the transcripts. Anyway, there also seems to be different patterns of BRG1 binding around the TSS. So we are going to use kmeans to cluster the patterns of BRG1 signal into 4 groups.
Show code: Partition de TSS into 4 groups based on BRG1 signal ```{r km5_brg1_tss, eval=TRUE, cache=TRUE} set.seed(123) km4_tss <- kmeans(as(smr_at_TSS[["BRG1"]], "matrix"), centers = 4) part_km4_tss <- km4_tss$cluster ```
Take a look at the partition that was made.
Show code: Plot a heatmap of the signal for the 3 factors for the partitioned TSS ```{r multiHeatmap_TSS_km4, eval=TRUE, cache=TRUE, fig.align='center'} #plot the heatmap multiHeatMatrix(smr_at_TSS, xcoords = c(-4000, 4000), winsorize = c(0, 99.95), group = factor(part_km4_tss)) ```
The 4 groups roughly correspond to 1. high and focused BRG1 binding at the TSS 2. diffuse binding around the TSS 3. focused binding at the TSS 4. no binding at the TSS We would like to know if there is some link between gene expression in WT controls and shBRG1 cells and this partition of the BRG1 binding at the TSS. To do this, we can add the `WT.norm` and `shBRG1` columns to the metadata of the `tss` object. We will also add the `signif` columns that we will need later. ```{r addcols_to_tss, eval=TRUE, cache=TRUE} ## first get the match between the names of tss and the ENTREZIDs in the RNAseq table mm <- match(names(tss), RNAseq$ENTREZID) ## then add the columns to the metadata of tss mcols(tss)$WT.norm <- RNAseq$WT.norm[mm] mcols(tss)$shBGR1.norm <- RNAseq$shBGR1.norm[mm] mcols(tss)$signif <- RNAseq$signif[mm] ## also add the info about the clusters mcols(tss)$clusters <- paste0("cluster", part_km4_tss) ``` Now we can extract a data frame from the `tss` object and reformat it to make a boxplot with ggplot2.
Show code: Reformat the `mcols` of the *tss* object to prepare it for ggplot2 ```{r boxplot_expr, eval=TRUE, cache=TRUE} exprdf <- tidyr::pivot_longer(as.data.frame(mcols(tss))[, c("WT.norm", "shBGR1.norm", "clusters")], cols = c("WT.norm", "shBGR1.norm"), names_to = "condition", values_to = "expression") head(exprdf) ```
And make the boxplot with ggplot2.
Show code: Make a boxplot with ggplot2 ```{r boxplot_exprdf, eval=TRUE, cache=TRUE, fig.align='center'} ggplot(exprdf, aes(x = clusters, y = log2(expression+1), color = condition)) + geom_boxplot() + scale_color_npg() + theme_bw() + theme(legend.position = "top") ```
We see that: - cluster 3, which has low or no BRG1 binding corresponds to genes with low expression - genes in cluster 2 which have diffused BRG1 binding at the TSS tend to have slightly higher expression upon BRG1 knockdown Let's see if the distribution of "significantly" differentially expressed genes differs among clusters.
Show code: Make a barplot for the `signif` variable for each cluster ```{r barplot_exprdf, eval=TRUE, cache=TRUE, fig.align='center'} ggplot(as.data.frame(mcols(tss)), aes(x = clusters, fill = signif)) + geom_bar(position = "fill", color="black") + scale_fill_manual(values=c(up="red",down="blue",stable="white")) + theme_bw() ```
We confirm that cluster2 has more significantly upregulated genes than the other clusters. # Enrichment of functional categories and pathway analyses ## GO enrichment analysis of genes in cluster 2 We want to know if the genes in cluster 2, which are characterized by a diffuse binding pattern of BRG1 at their TSS and a tendency to be upregulated upon BRG1 knockdown, are enriched for specific Gene Ontology ([GO](https://geneontology.org/)) Biological Process annotations. The [clusterprofiler](https://www.bioconductor.org/packages/release/bioc/html/clusterProfiler.html) package provides very useful functions to perform either classical enrichment analysis of functional categories in gene lists or gene set enrichment analyses. Here we will use the classical approach, based on the hypergeometric distribution, called "enrichment analysis" or "GO over-representation analysis" in the terminology of *clusterprofiler* documentation. \ ![Représentation simplifiée graphique du test hypergéometrique](images/hypergeometric.png) We can easily extract the IDs of the genes in cluster2 using for example: - `names(tss)[tss$clusters == "cluster2"]` The main question that we need to answer is what is the gene universe? In general, you should think as the "gene universe" as the set of genes that could potentially have been selected in your gene list. Since we did different filters, e.g. by selecting genes present in the RNAseq dataset or that had a match between the ENSEMBLID and the ENTREZID, our gene universe does NOT encompass all the genes in the genome. In our case, a good definition of the gene universe would be all the genes in the tss object, since we performed kmeans on these genes to identify genes in cluster 2. So now, let's use this info to compute the GO over-representation statistics: ```{r GOenrich_cluster2, eval=TRUE, cache=TRUE} # load the library library(clusterProfiler) # Run the enrichment analysis ego <- enrichGO(gene = names(tss)[tss$clusters == "cluster2"], universe = names(tss), OrgDb = org.Hs.eg.db, ont = "BP", pAdjustMethod = "BH", pvalueCutoff = 0.01, qvalueCutoff = 0.05, readable = TRUE) ``` And take a look at the results: ```{r GOenrich_cluster2_res, eval=TRUE, cache=TRUE} head(ego) ``` ## KEGG enrichment analysis in DE genes from the different clusters Now let's perform the same type of analysis but focusing on the differentially expressed (DE) genes present in each of the 4 clusters. Here, we're asking if DE genes that have different BRG1 binding patterns at their TSS are enriched in specific KEGG pathways. The *clusterProfiler* package provides the function `compareCluster` that allows to analyze and compare enrichment in different group of genes. This function recognizes ENTREZ identifiers. ```{r funcEnr, include=TRUE, eval=TRUE, message=FALSE, warning=FALSE, cache=TRUE} # Select genes whose gene expression is significantly changing # between conditions geneList <- tss[tss$signif != "stable",] # The function split can split a vector to a list following categories # in an other vector (take a look at the geneList object after running split). geneList <- split(names(geneList), geneList$clusters) # Now perform a KEGG pathway over-representation analysis compKEGG <- compareCluster(geneCluster = geneList, fun = "enrichKEGG", pvalueCutoff = 0.05, pAdjustMethod = "BH") # Plot the results dotplot(compKEGG, showCategory = 15, title = "KEGG Pathway Enrichment Analysis") ``` Let's focus on cluster 4 and visualize which genes of the the KEGG pathway "Apoptosis" are both differentially expressed and bound by BRG1 and MITF. We obtain the identifier of the pathway by looking at the `compKEGG` object: - `compKEGG@compareClusterResult` So the ID of the KEGG pathway "Apoptosis" is *hsa04210*. We use the `pathview` package which maps differential gene expression values to KEGG pathway maps and create a PNG file in your current directory that you get with `getwd()`. ```{r funcEnrcl4, eval=FALSE, message=FALSE, warning=FALSE} # load the library library(pathview) # Get the names of the genes in cluster 4 EntrezID.cl4 <- names(tss)[tss$clusters == "cluster4"] # match these EntrezIDs to the RNAseq dataset mmid <- match(EntrezID.cl4, RNAseq$ENTREZID) # Get the log fold-change for these genes from the RNAseq table logFC.cl4 <- RNAseq$logFC[mmid] names(logFC.cl4) <- RNAseq$ENTREZID[mmid] # Create an image of the pathway with colors for differential gene expression pathview(logFC.cl4, pathway.id = "hsa04210", species = "hsa") ``` ![Pathview](images/hsa04210.pathview.png) # Motif analysis Often, one of the task associated with ChIP-seq experiments consists in analyzing the sequences that underlie the peaks. The aim could be to identify the binding motif of the factor that was immunoprecipitated, or to evaluate if other factors could bind the same regions. Here we will illustrate some basic tasks associated with these aims. First, load the *Biostrings* library and the *BSgenome* library corresponding to the hg38 human genome. ```{r loadLibs_BSgenome, eval=FALSE} ## Load the Biostrings library and the BSgenome library corresponding to hg38 library(Biostrings) library(BSgenome.Hsapiens.UCSC.hg38) ## A "Hsapiens" object allows to access the genome sequence ``` ## De novo motif discovery Here, we will search for motifs enriched under the SOX10 peaks. Since we do not start from a pre-defined list of motifs, e.g. available in a database like [JASPAR](https://jaspar.elixir.no/) or [CIS-BP](https://cisbp.ccbr.utoronto.ca/), this is called a *de novo* search. We start by sorting SOX10 peaks by qvalue and keep the sequences corresponding to the peak centers +/- 150bp: ```{r sox10peaks_300bp, eval=TRUE, cache=TRUE} sox10peaks_300bp <- resize(peakfilt[["SOX10"]][order(peakfilt[["SOX10"]]$qvalue, decreasing = TRUE)], width = 300, fix = "center") ``` Next, we extract the underlying sequences and obtain a `DNAStringSet` object: ```{r sox10seq, eval=TRUE, cache=TRUE} sox10_seq <- getSeq(Hsapiens, sox10peaks_300bp) ``` Most of the time, you will want to export these sequences as a fasta file and run an external motif discovery program such as: - [MEME](https://meme-suite.org/meme/) or - [RSAT](https://rsat.france-bioinformatique.fr/) Note that the [TFBStools](https://www.bioconductor.org/packages/release/bioc/html/TFBSTools.html) package has a `runMEME` function that is a wrapper to MEME. To export your the sequences you can use: ```{r export_sox10_seq, eval=FALSE} ## Define the path to the fasta file: myfastafile = file.path("data", "sox10peaks_300bp.fa") ## NOT RUN (already exported): export sequences in fasta format writeXStringSet(sox10_seq, myfastafile) ``` Here, we will use the [BCRank](https://www.bioconductor.org/packages/release/bioc/html/BCRANK.html) package which can also perform a de novo search for motifs within R. The method takes a list of DNA sequences sorted by some measure of the probability that a relevant motif is present. Here, the qvalue plays the role of this measure, which is why we sorted the peaks by qvalue. `BCrank` takes a long time to run, so we will not run the commands below. ```{r bcrank, eval=FALSE} ##NOT RUN (too long...) set.seed(123) #for reproducibility BCRANKout <- bcrank(myfastafile, restarts = 25, silent = TRUE, use.P1 = TRUE, use.P2 = TRUE) saveRDS(BCRANKout, file.path("data", "BCRANKout.rds")) ``` Instead, we ran the commands for you and saved the object as an rds file. So we can just import this file and take a look at it: ```{r bcrankout_import, eval=TRUE, cache=TRUE} BCRANKout <- readRDS(file.path("data", "BCRANKout.rds")) BCRANKout ``` We can extract the top motif found by BCRANK with: ```{r bcrank_topmotif, eval=TRUE, cache=TRUE} topMotif <- toptable(BCRANKout, 1) ``` and extract the Position Weight Matrix (PWM) for this motif and plot the sequence logo with: ```{r pwm_topmotif_bcrank, eval=TRUE, fig.align='center', cache=TRUE} pwmnorm_topmotif <- BCRANK::pwm(topMotif, normalize = TRUE) #with normalize=TRUE the columns will sum to 1 (PFM) library(seqLogo) seqLogo(reverseComplement(pwmnorm_topmotif)) ``` You can go to the [JASPAR](https://jaspar.elixir.no/) website and search "SOX10". You will see that the first motif found by BCRANK is very similar to the motifs that have been found for SOX10. We will actually retrieve and take a look at one of these motifs below. ## Searching for known motifs Another approach consists in searching for known motifs in the ChIP-seq peaks. Since the SOX10 has been extensively studied, there are already SOX10 binding motifs available in the databases. ```{r loadMotifDb, message=FALSE, warning=FALSE, eval=FALSE} library(MotifDb) ``` Search for SOX10 motifs in JASPAR databases ```{r motifdb, eval=TRUE, fig.align='center',cache=TRUE} sox10_motifs = MotifDb::query(MotifDb, c("hsapiens", "jaspar", "sox10")) ## We keep the motif from JASPAR 2022 Sox10JASP22 <- sox10_motifs[["Hsapiens-jaspar2022-SOX10-MA0442.2"]] seqLogo(Sox10JASP22) ``` MotifDb returns a position frequency matrix (PFM) with columns summing to 1. The number of sequences used to build the PFM are available in the elementMetadata or mcols of the object. We use this information to obtain a PWM: ```{r Sox10_PWM, eval=TRUE, cache=TRUE} nseq <- as.integer(mcols(sox10_motifs["Hsapiens-jaspar2022-SOX10-MA0442.2"])$sequenceCount) sox10pfm <- apply(round(nseq * Sox10JASP22,0), 2, as.integer) rownames(sox10pfm) <- rownames(Sox10JASP22) Sox10JASP22_PWM <- PWM(sox10pfm) ``` And then use this PWM to screen the whole genome. Again, this takes some time so we saved the rds object for you ```{r matchPWM_SOX10_Hsapiens, eval=FALSE} ## NOT RUN (too long...) SOX10_hg38 <- matchPWM(Sox10JASP22_PWM, Hsapiens, min.score="90%") saveRDS(SOX10_hg38, file.path("data", "SOX10_hg38.rds")) ``` We obtain a GRanges with the location of all the motifs: ```{r SOX10_hg38_import, eval=TRUE, cache=TRUE} SOX10_hg38 <- readRDS(file.path("data", "SOX10_hg38.rds")) SOX10_hg38 ``` We can now overlap these motifs with our ChIPseq peaks: ```{r ovl_SOX10hg38_peaks, eval=TRUE, cache=TRUE} ## We set a minimum overlap of 11bp (size of the motif) to count only ## the motifs that are fully within the peaks count_sox10 <- countOverlaps(sox10peaks_300bp, SOX10_hg38, minoverlap = 11) ## Number of peaks that have 0, 1, 2, etc. SOX10 binding sites table(count_sox10) ## Percentage of our peaks that have at least one SOX10 binding site: 100*mean(count_sox10>=1) ``` We see that more than `r round(100*mean(count_sox10>=1), 1)`% of the peaks contain at least one of these motifs. Assessing whether our peaks are significantly enriched for SOX10 motifs would probably require to extract random sequences that share common characteristics with SOX10 ChIP-seq peaks in terms of size and genomic distribution and evaluate their enrichment in SOX10 binding sites. Here as a substitute, let's evaluate the frequency of SOX10 peaks in a random selection of promoters ```{r count_sox10_tss, eval=TRUE, cache=TRUE} set.seed(123) # for reproducibility count_sox10_tss <- countOverlaps(promoters(genes(txdb), upstream=300, downstream=0)[sample.int(length(sox10peaks_300bp))], SOX10_hg38, minoverlap = 11) ## Percentage of 300bp promoters that contain SOX10 peaks: 100*mean(count_sox10_tss>=1) ``` This suggests that the enrichment of SOX10 binding sites in our ChIP-seq peaks is very significant. # Enriching your analyses with published data There are multiple ways to access public / published datasets and multiple sources that you can mine in order to compare your own data with what other groups have observed. The following databases are only some examples: -The [(mod)ENCODE](https://www.encodeproject.org/chip-seq-matrix/?type=Experiment&replicates.library.biosample.donor.organism.scientific_name=Homo%20sapiens&assay_title=TF%20ChIP-seq&status=released) project - The [UCSC](https://genome-euro.ucsc.edu/cgi-bin/hgGateway?redirect=manual&source=genome.ucsc.edu) genome browser - The [GEO](https://www.ncbi.nlm.nih.gov/geo/) database and the [SRA](https://www.ncbi.nlm.nih.gov/sra) - The [ENA](https://www.ebi.ac.uk/ena/browser/home) database (which also replicates GEO) and the [ArrayExpress](https://www.ebi.ac.uk/biostudies/arrayexpress) database - The [TCGA](https://www.cancer.gov/ccg/research/genome-sequencing/tcga) - The [Cistrome](https://cistrome.org/db/) data browser - The [ReMap](https://remap.univ-amu.fr/) database - The [ChIP atlas](https://chip-atlas.org/) - etc... ## Mining GEO ChIP-seq data with `ChIPSeeker` There Gene Expression Omnibus (GEO) contains a vast number of ChIP-seq datasets. We can compare our own dataset to those deposited in GEO to search for significant overlap data. Significant overlap of ChIP seq data by different binding proteins may be used to infer cooperative regulation and thus can be used to generate hypotheses. The [ChIPseeker](https://www.bioconductor.org/packages/release/bioc/html/ChIPseeker.html) package [@pmid36286622] package provides useful functions to download and use GEO datasets in R / Bioconductor. We can collect >100,000 bed files deposited in GEO. You can use `getGEOspecies` to get the number of files available for different species. ```{r geospecies, eval=TRUE, message=FALSE, warning=FALSE} getGEOspecies()[1:25,] ``` You can also use the `getGEOgenomeVersion` to get the same information but based on genome version used to analyze the data. You can access the detailed information using the `getGEOInfo` function, for each genome version. ```{r, eval=TRUE, cache=TRUE} hg38 <- getGEOInfo(genome = "hg38", simplify = TRUE) head(hg38) ``` ChIPseeker provides the function `downloadGEObedFiles` to download all the bed files of a particular genome. ```{r, eval=FALSE} downloadGEObedFiles(genome = "hg38", destDir = "hg38") ``` Or a vector of GSM accession number by `downloadGSMbedFiles`. ```{r, eval=FALSE} gsm <- hg38$gsm[sample(nrow(hg38), 10)] downloadGSMbedFiles(gsm, destDir = "hg38") ``` After the download of bed files from GEO, we can pass them to `enrichPeakOverlap` for testing the significance of the overlaps. Parameter targetPeak can be the folder, e.g. hg38, that containing bed files. `enrichPeakOverlap` will parse the folder and compare all the bed files. It is possible to test the overlap with bed files that are mapped to different genome or different genome versions: `enrichPeakOverlap` provides a parameter `chainFile` that can pass a chain file and liftOver the `targetPeak` to the genome version consistent with `queryPeak.` Significant overlaps can be used to generate hypotheses on cooperative regulation. By mining the data deposited in GEO, we can identify putative complexes or interacting regulators of gene expression or chromatin remodeling for further validation. ## The `AnnotationHub` The *AnnotationHub* is a web resource that provides a central location where genomic files in different formats (e.g. vcf, bed, wig, etc.) and other resources coming from different locations (e.g. UCSC, ENSEMBL, ENCODE, etc.) can be mined and downloaded to enrich your analyses. Importantly, the resource that are available include **metadata** with e.g. description of the experiments, tags and date of modification which are important to figure out where exactly the data comes from and if it fits with the aim of your analyses. The [AnnotationHub](https://bioconductor.org/packages/release/bioc/html/AnnotationHub.html) package is a client for R/Bioconductor to search these resources and download the data in appropriate formats for their analysis in the R/Bioconductor environment. The client creates and manages a local cache of files that are retrieved by the user, which allows quick and reproducible access to the data. First we need to create an `AnnotationHub` instance. This will create a local cache on your machine so that repeat queries for the same information (even in different R sessions) will be very fast. By default, the cache will be created in your home folder, which may not be appropriate in a shared server if you have a limited quota of space in your home directory. You can change the location of the cache with: ```{r setAnnotationHubOption_CACHE, eval=FALSE} ## NOT RUN: example of how to set the path to the cache setAnnotationHubOption("CACHE") <- "/shared/projects/projectname/AnnotationHub" ## to check that the path was set correctly: getAnnotationHubOption("CACHE") ``` Then we create the `AnnotationHub` instance with: ```{r AnnotationHub_create, eval=FALSE} ##NOT RUN (too long...) ah <- AnnotationHub() # |=============================================================================| 100% # # snapshotDate(): 2024-10-28 ``` This will take some time the first time you run it since there are a lot of resources available. The `ah` object is quite big but it only contains pointers to online data, not the data itself which would be way too big to download all at once. The information content is changing frequently, which is why there is a `snapshotDate`. Let's take a look at the `ah` object: ```{r ah, eval=FALSE} ##NOT RUN ah # AnnotationHub with 72100 records # # snapshotDate(): 2024-10-28 # # $dataprovider: Ensembl, BroadInstitute, UCSC, ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/, FANTOM5,DLRP,IUPHAR,HPRD,STRING,SWISSPROT,TREMBL,ENSEMBL,CELLPHONEDB,... # # $species: Homo sapiens, Mus musculus, Drosophila melanogaster, Rattus norvegicus, Bos taurus, Pan troglodytes, Danio rerio, Gallus gallus, Sus scrofa, Mon... # # $rdataclass: GRanges, TwoBitFile, BigWigFile, EnsDb, Rle, OrgDb, SQLiteFile, ChainFile, TxDb, Inparanoid8Db # # additional mcols(): taxonomyid, genome, description, coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags, rdatapath, # # sourceurl, sourcetype # # retrieve records with, e.g., 'object[["AH5012"]]' # # title # AH5012 | Chromosome Band # AH5013 | STS Markers # AH5014 | FISH Clones # AH5015 | Recomb Rate # AH5016 | ENCODE Pilot # ... ... # AH119506 | Ensembl 113 EnsDb for Zonotrichia albicollis # AH119507 | Ensembl 113 EnsDb for Zalophus californianus # AH119508 | Ensembl 113 EnsDb for Zosterops lateralis melanops # AH119518 | MassBank CompDb for release 2024.06 # AH119519 | MassBank CompDb for release 2024.11 ``` As of `2024-10-28`, the AnnotationHub contains **72,100 records**. The `ah` object is organized as a vector (so its `length` is 72100 in my case). You can get information about a single resource by indexing with a single `[`; using two brackets (`[[`) downloads the object: ```{r ah1, eval=FALSE} ##NOT RUN ah[1] # AnnotationHub with 1 record # # snapshotDate(): 2024-10-28 # # names(): AH5012 # # $dataprovider: UCSC # # $species: Homo sapiens # # $rdataclass: GRanges # # $rdatadateadded: 2013-03-26 # # $title: Chromosome Band # # $description: GRanges object from UCSC track 'Chromosome Band' # # $taxonomyid: 9606 # # $genome: hg19 # # $sourcetype: UCSC track # # $sourceurl: rtracklayer://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/cytoBand # # $sourcesize: NA # # $tags: c("cytoBand", "UCSC", "track", "Gene", "Transcript", "Annotation") # # retrieve record with 'object[["AH5012"]]' ``` Now, we can use various tools to narrow down the dataset or datasets that are actually of interest for our analyses. Let's say that we are looking for a dataset describing regions in the human genome that are potential distal enhancers. The ENCODE project has typically defined such regions using the data they have produced on e.g. histone marks H3K27ac, H3K4me3 or DNAseI accessibility. Let's take a look at the data providers: ```{r ah_provider, eval=FALSE} ##NOT RUN unique(ah$dataprovider) # [1] "UCSC" # [2] "Ensembl" # [3] "RefNet" # [4] "Inparanoid8" # ... # [72] "ENCODE" # ... # [83] "MGI" # [84] "ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/" ``` Let's take a look at the available data from encode: ```{r ah_encode, eval=FALSE} ##NOT RUN ## Number of datasets available from ENCODE (note that there are more since in ## "dataprovider" there is also e.g. "ENCODE Project" or "ENCODE SCREEN v3" sum(ah$dataprovider == "ENCODE") # [1] 20 # So there are 20 datasets from the dataprovider "ENCODE" ## Take a look at the titles of these datasets ah[ah$dataprovider == "ENCODE"]$title # [1] "hg38.Kundaje.GRCh38_unified_Excludable" "hg38.Bernstein.Mint_Excludable_GRCh38" # [3] "hg38.Kundaje.GRCh38.Excludable" "hg38.Reddy.wgEncodeDacMapabilityConsensusExcludable.hg38" # [5] "hg38.Wold.hg38mitoExcludable" "hg38.Yeo.eCLIP_Excludableregions.hg38liftover.bed.fixed" # [7] "hg19.Bernstein.Mint_Excludable_hg19" "hg19.Birney.wgEncodeDacMapabilityConsensusExcludable" # [9] "hg19.Crawford.wgEncodeDukeMapabilityRegionsExcludable" "hg19.Wold.hg19mitoExcludable" # [11] "hg19.Yeo.eCLIP_Excludableregions.hg19" "mm10.Hardison.Excludable.full" # [13] "mm10.Hardison.psuExcludable.mm10" "mm10.Kundaje.anshul.Excludable.mm10" # [15] "mm10.Kundaje.mm10.Excludable" "mm10.Wold.mm10mitoExcludable" # [17] "mm9.Wold.mm9mitoExcludable" "ENCODE_dELS_regions" # [19] "ENCODE_pELS_regions" "ENCODE_PLS_regions" ``` I wonder what *"PLS_regions"* are `r emo::ji("thinking_face")` and I want to know more. So let's take a look at the object description: ```{r ah_pls_desc, eval=FALSE} ##NOT RUN ah[ah$dataprovider == "ENCODE" & ah$title == "ENCODE_PLS_regions"]$description # [1] "A GRanges object containing regions of candidate cis-regulatory elements with promoter-like signatures # as identified by the ENCODE SCREEN project. These consist of regions with high H3K4me3 and DNase signal, and # located within 200 bp of a GENCODEtranscription start site." ``` OK, this looks very close to what we are looking for and I suppose that *pELS* and *dELS* refer to *proximal* and *distal* enhancer-like signatures. Let's verify: ```{r ah_dELS_desc, eval=FALSE} ##NOT RUN ah[ah$dataprovider == "ENCODE" & ah$title == "ENCODE_dELS_regions"]$description # [1] "A GRanges object containing regions of candidate cis-regulatory elements with distal enhancer-like # signatures as identified by the ENCODE SCREEN project. These consist of regions with high H3K27ac and # DNase signal, but low H3K4me3 signal, and located more than 2kb from GENCODE transcription start sites." ``` So now we can get the names of the object, download it and use it in R (it's a *GRanges* object): ```{r dwnld_, eval=FALSE} ##NOT RUN ## Get the name / ID of the dataset names(ah[ah$dataprovider == "ENCODE" & ah$title == "ENCODE_dELS_regions"]) # [1] "AH116727" ## Download the dataset (use double brackets `[[`) dELS <- ah[["AH116727"]] # downloading 1 resources # retrieving 1 resource # |===========================================================================| 100% # # loading from cache # require(“GenomicRanges”) ## Take a look at the object: dELS # GRanges object with 786756 ranges and 0 metadata columns: # seqnames ranges strand # # [1] chr1 271227-271468 * # [2] chr1 274330-274481 * # [3] chr1 605331-605668 * # [4] chr1 727122-727350 * # [5] chr1 807737-807916 * # ... ... ... ... # [786752] chrY 26319537-26319816 * # [786753] chrY 26319848-26320155 * # [786754] chrY 26363541-26363721 * # [786755] chrY 26670949-26671287 * # [786756] chrY 26671293-26671478 * # ------- # seqinfo: 24 sequences from an unspecified genome; no seqlengths ## Since seqinfo has not been set, you may wonder which genome version was used. ## You can get this info from the metadata of the ah object: ah['AH116727']$genome # [1] "hg38" ``` In another example, we will use some useful function to search the AnnotationHub: - `subset` (or `[`) to do a specific subseting operation. - `query` to do a command-line search over the metadata of the hub. - `display` to get a Shiny interface in a browser, so you can browse the object. We would like to know if there are any other dataset available for MITF in human. We first filter the hub to keep only the datasets obtained in human, or the datasets where the *species* is unspecified (`NA`), in case some metadata is missing. ```{r ah_subset, eval=FALSE} ##NOT RUN ah <- subset(ah, species %in% c(NA, "Homo sapiens")) ``` Then we query the metadata for MITF: ```{r ah_mitf, eval=FALSE} ##NOT RUN query(ah, "MITF") # AnnotationHub with 1 record # # snapshotDate(): 2024-10-28 # # names(): AH22280 # # $dataprovider: Pazar # # $species: NA # # $rdataclass: GRanges # # $rdatadateadded: 2015-03-09 # # $title: pazar_MITF_Strub_20120522.csv # # $description: TF - Target Gene file from pazar_MITF_Strub_20120522 # # $taxonomyid: NA # # $genome: NA # # $sourcetype: CSV # # $sourceurl: http://www.pazar.info/tftargets/pazar_MITF_Strub_20120522.csv # # $sourcesize: 34964 # # $tags: c("Pazar", "Transcription Factors") # # retrieve record with 'object[["AH22280"]]' ``` Indeed this dataset does not contain all the metadata but there is a link to the original source file and searching the web for "MITF, Strub" quickly gets you to [this publication](https://pubmed.ncbi.nlm.nih.gov/21258399/) where all the info is available to evaluate if this dataset is relevant for us. Finally, another way to mine the AnnotationHub is to use a browser with: ```{r ah_display, eval=FALSE} ##NOT RUN outdata <- display(ah) ``` Note how we assign the call to the `display` function to an R object (`oudata`) so that we can actually capture the selection of datasets that we will make from the browser into our R session. # Importance of the annotation databases In previous annotation steps, we have annotated peaks using annotation from UCSC's knownGene. The database chosen for annotation can have an impact on subsequent results including data integration such as comparison between genes associated to peaks and gene expression using RNA-seq data for instance. Let's look at the overlap between genes expressed in RNA-seq data with genes associated to peaks (let's remind that peaks are associated to closest genes if no other evidences are used). ```{r compareAnnot, eval=TRUE, fig.align = 'center', message=FALSE, cache=TRUE} ## Download GTF files library(GenomicFeatures) ## Import BRG1 peaks annotated with Ensembl and Refseq annotation annot <- readRDS(file.path(params$workdir, "data", params$annot)) ## Here is how annotation data were downloaded # txdb.ensembl = makeTxDbFromGFF("data/ChIPseq/Homo_sapiens.GRCh38.103_UCSC_chr.gtf.gz") # txdb.refseq = makeTxDbFromUCSC(genome="hg38", tablename="refGene") ## Here is how annotation was performed # annot <- list() # annot[["ensembl"]] <- annotatePeak(peaks$BRG1, tssRegion=c(-1000, 100), TxDb=txdb.ensembl, annoDb="org.Hs.eg.db") # annot[["refseq"]] <- annotatePeak(peaks$BRG1, tssRegion=c(-1000, 100), TxDb=txdb.refseq, annoDb="org.Hs.eg.db") ## Add Gene symbols to RNAseq data conv.symbol <- mapIds(x = org.Hs.eg.db, keys = RNAseq$Ensembl.Gene.ID, # what do we want to be mapped column = c("SYMBOL"), # which type of ID we want keytype = "ENSEMBL") # what type of ID we give mEnsRS <- match(RNAseq$Ensembl.Gene.ID, names(conv.symbol)) RNAseq$SYMBOL <- conv.symbol[mEnsRS] ## Now let's create a venn diagram that compares peak annotation with ## refseq or Ensembl library(ggVennDiagram) x <- list(RNAseq = RNAseq[RNAseq$signif=="up",]$SYMBOL, Ensembl = as.data.frame(annot$ensembl)$SYMBOL, Refseq = as.data.frame(annot$refseq)$SYMBOL ) ggVennDiagram(x) + scale_fill_gradient(low="white",high = "blue") ``` # Session info ```{r sessionInfo, include=TRUE, eval=TRUE} sessionInfo() ``` # References