The Biostrings package
is the core package to manipulate sequence files in R.
Here, we will see some basic usage of this package.
First, let’s see how to import a fasta file. For this, we will use a file provided with the Biostrings package that contains the sequences 2000 upstream of the annotated transcription start sites (TSS) for the Drosophila melanogaster genome.
Get the file path:
This file contains DNA sequences so we use the readDNAStringSet
function to import it:
## DNAStringSet object of length 26454:
## width seq names
## [1] 2000 GTTGGTGGCCCACCAGTGCCA...CAAGTTTACCGGTTGCACGGT NM_078863_up_2000...
## [2] 2000 TTATTTATGTAGGCGCCCGTT...ATACGGAAAGTCATCCTCGAT NM_001201794_up_2...
## [3] 2000 TTATTTATGTAGGCGCCCGTT...ATACGGAAAGTCATCCTCGAT NM_001201795_up_2...
## [4] 2000 TTATTTATGTAGGCGCCCGTT...ATACGGAAAGTCATCCTCGAT NM_001201796_up_2...
## [5] 2000 TTATTTATGTAGGCGCCCGTT...ATACGGAAAGTCATCCTCGAT NM_001201797_up_2...
## ... ... ...
## [26450] 2000 ATTTACAAGACTAATAAAGAT...AAAATTAAATTTCAATAAAAC NM_001111010_up_2...
## [26451] 2000 GATATACGAAGGACGACCTGC...TTGTTTGAGTTGTTATATATT NM_001015258_up_2...
## [26452] 2000 GATATACGAAGGACGACCTGC...TTGTTTGAGTTGTTATATATT NM_001110997_up_2...
## [26453] 2000 GATATACGAAGGACGACCTGC...TTGTTTGAGTTGTTATATATT NM_001276245_up_2...
## [26454] 2000 CGTATGTATTAGTTAACTCTG...TCAAAGTGTAAGAACAAATTG NM_001015497_up_2...
The sequences are imported as a DNAStringSet
object.
There are similar functions and containers to work with:
readRNAStringSet
gives an RNAStringSet
readAAStringSet
gives an AAStringSet
readBStringSet
gives a BStringSet
The DNAStringSet
object can be considered as a list of DNAString
objects.
We can thus extract the 3rd sequence using:
## 2000-letter DNAString object
## seq: TTATTTATGTAGGCGCCCGTTCCCGCAGCCAAAGCA...CGAATTAATCGATAGATACGGAAAGTCATCCTCGAT
# or, if we know the name of the sequence:
dm3_upstream[["NM_001201795_up_2000_chr2L_8382455_f chr2L:8382455-8384454"]]
## 2000-letter DNAString object
## seq: TTATTTATGTAGGCGCCCGTTCCCGCAGCCAAAGCA...CGAATTAATCGATAGATACGGAAAGTCATCCTCGAT
Here are 2 different ways to extract a subsequence from a DNAString
:
## 29-letter DNAString object
## seq: TATTTATGTAGGCGCCCGTTCCCGCAGCC
## 29-letter DNAString object
## seq: TATTTATGTAGGCGCCCGTTCCCGCAGCC
But only the subseq
command would work on a DNAStringSet
:
## DNAStringSet object of length 3:
## width seq names
## [1] 11 TTGGTGGCCCA NM_078863_up_2000...
## [2] 11 TATTTATGTAG NM_001201794_up_2...
## [3] 11 TATTTATGTAG NM_001201795_up_2...
It is also possible to extract multiple subsequences from a DNAString
using Views
:
## Views on a 2000-letter DNAString subject
## subject: TTATTTATGTAGGCGCCCGTTCCCGCAGCCAAAG...AATTAATCGATAGATACGGAAAGTCATCCTCGAT
## views:
## start end width
## [1] 1 10 10 [TTATTTATGT]
## [2] 11 20 10 [AGGCGCCCGT]
## [3] 21 30 10 [TCCCGCAGCC]
Views
objects actually store the original sequence (or subject
) and the intervals or ranges
on this sequence.
Because we were working on a DNAString
object, the Views
function returned an XStringViews
object (See ?XStringViews
for details and methods).
To convert a DNAString
to a character string:
## [1] "TTATTTATGT"
# as.character also works but toString works on DNAStringSt objects too:
toString(subseq(dm3_upstream[1:2], start=1, end=10))
## [1] "GTTGGTGGCC, TTATTTATGT"
Whole genome sequences are stored into BSgenome
objects.
Many genomes are available in Bioconductor.
But you can also build your own using the BSgenomeForge package.
Let’s load and check the dm3 BSgenome package:
What is a bit confusing at the beginning is that loading this package creates an object named Dmelanogaster
:
## | BSgenome object for Fly
## | - organism: Drosophila melanogaster
## | - provider: UCSC
## | - genome: dm3
## | - release date: Apr. 2006
## | - 15 sequence(s):
## | chr2L chr2R chr3L chr3R chr4 chrX chrU
## | chrM chr2LHet chr2RHet chr3LHet chr3RHet chrXHet chrYHet
## | chrUextra
## |
## | Tips: call 'seqnames()' on the object to get all the sequence names, call
## | 'seqinfo()' to get the full sequence info, use the '$' or '[[' operator to
## | access a given sequence, see '?BSgenome' for more information.
## [1] "chr2L" "chr2R" "chr3L" "chr3R" "chr4" "chrX"
## [7] "chrU" "chrM" "chr2LHet" "chr2RHet" "chr3LHet" "chr3RHet"
## [13] "chrXHet" "chrYHet" "chrUextra"
## 23011544-letter DNAString object
## seq: CGACAATGCACGACAGAGGAAGCAGAACAGATATTT...GCATATTTGCAAATTTTGATGAACCCCCCTTTCAAA
The Rsamtools also provides interesting functions to work on large indexed FASTA files (e.g. containing a whole genome sequence).
The short example below illustrates how to extract from a FASTA file a set of sequences defined by a GRanges
:
library(Rsamtools)
indFaEx_path=system.file("extdata","ce2dict1.fa",package="Rsamtools")
indFaEx=FaFile(indFaEx_path) #the whole sequence is actually not imported at this step
getSeq(indFaEx,GRanges(seqnames=Rle(c("pattern01","pattern04")),
ranges=IRanges(start=c(3,10),end=c(10,24)),
strand="*"))
## DNAStringSet object of length 2:
## width seq names
## [1] 8 GAAACTAG pattern01
## [2] 15 TTGTTGCAAATTTGA pattern04
Note that the Biostrings package also has useful functions to extract some information from fasta/fastq files without loading the file in memory. See help(fasta.seqlengths, package = "Biostrings")
for the corresponding help page.
Reverse complement:
## DNAStringSet object of length 2:
## width seq names
## [1] 2000 ACCGTGCAACCGGTAAACTTGCC...TTTGGCACTGGTGGGCCACCAAC NM_078863_up_2000...
## [2] 2000 ATCGAGGATGACTTTCCGTATCT...GGAACGGGCGCCTACATAAATAA NM_001201794_up_2...
Count the occurence of each nucleotide:
## A C G T other
## [1,] 0.323 0.191 0.1875 0.2985 0
## [2,] 0.300 0.207 0.2230 0.2700 0
Get the GC content of a sequence:
## C|G
## 0.41835
We will not go into the details but there are also functions to perform sequence alignment in R (pwalign package):
## Global PairwiseAlignmentsSingleSubject (1 of 1)
## pattern: GT-----TGGTGGCCCACCAGTGCCAAAATACACA
## subject: TTATTTATGTAGGCGCCC--GTTCCCGCA---GCC
## score: -107.1475
Sequence motifs or patterns are typically used to represent the DNA regions bound by transcription factors (TFs) and other regulatory proteins.
We distinguish a pattern which is a sequence, possibly using the IUPAC code when there is an ambiguity between nucleotides, from a motif which is a matrix basically indicating what is the probablity to find a given nucleotide at a given position in the sequence.
There are several packages in Bioconductor to search for and identify such patterns in strings and sequences. Pattern matching is frequently used in ChIP-seq experiments to search e.g. for binding sites of transcription factors.
A complete Bioconductor workflow illustrates the use of pattern matching for the identification of transcription factor binding sites.
Here, we illustrate the use of the following packages:
Other packages of interest include:
You can search the MotifDiscovery and MotifAnnotation BioCViews to find other useful packages.
The MotifDb package allows to query a collection of DNA motifs aggregated from different databases.
We search these databases for the response element of the ecdysone receptor (EcR):
## MotifDb object of length 5
## | Created from downloaded public sources, last update: 2022-Mar-04
## | 5 position frequency matrices from 5 sources:
## | FlyFactorSurvey: 1
## | JASPAR_2014: 1
## | jaspar2016: 1
## | jaspar2018: 1
## | jaspar2022: 1
## | 1 organism/s
## | Dmelanogaster: 5
## Dmelanogaster-FlyFactorSurvey-EcR_SANGER_5_FBgn0000546
## Dmelanogaster-JASPAR_2014-EcR::usp-MA0534.1
## Dmelanogaster-jaspar2016-EcR::usp-MA0534.1
## Dmelanogaster-jaspar2018-EcR::usp-MA0534.1
## Dmelanogaster-jaspar2022-EcR::usp-MA0534.1
## 1 2 3 4 5 6 7 8
## A 0.6428571 0 0 1 0.00000000 0 0.0000000 0.0
## C 0.0000000 0 0 0 0.92857143 1 0.1428571 0.5
## G 0.0000000 0 1 0 0.07142857 0 0.0000000 0.0
## T 0.3571429 1 0 0 0.00000000 0 0.8571429 0.5
The motif is given as a Position Frequency Matrix (PFM), which, can be converted to a Position-weight matrix using the information present in the metadata of the object (number of sequences used to define the motif: see elementMetadata(EcRMotifs)
).
The package seqLogo allows to easily plot sequence logos:
##
## Attaching package: 'grid'
## The following object is masked from 'package:Biostrings':
##
## pattern
## Loading required package: motifStack
##
Here is the logo of the first motif:
(#fig:SeqLogo_Ecr)Sequence logo for EcR motif
Or its reverse complement:
(#fig:SeqLogo_Ecr_revcomp)Sequence logo for EcR motif reverse complement
And the logo of the EcR motif retrieved from JASPAR 2022:
(#fig:SeqLogo_EcrUsp)Sequence logo for EcR:Usp heterodimer
The second motif (from JASPAR) represented above is a binding site for the heterodimer composed of the ecdysone receptor (EcR) and its binding partner Ultraspiracle protein (Usp). This imperfect palindromic ecdysone response element is an inverted repeat of the consensus motif AGGTCA (or AGTTCA) separated by 1 nucleotide (IR1 for “inverted repeat separated by 1 nucleotide”).
The package motifStack provides additional plotting functionalities to plot multiple sequence logos, as illustrated below:
#Format the PFMs:
pfms <- as.list(EcRMotifs)
names(pfms) <- gsub("Dmelanogaster-", "", names(pfms))
pfms <- lapply(names(pfms),
function(x) {new("pfm",
mat=pfms[[x]],
name = x)})
#Plot
motifStack(pfms, layout="tree")
## Loading required namespace: Cairo
(#fig:motifStack_Ecr)Sequence logos for EcR motifs using motifStack
For those using ggplot2 for plotting, there is also the excellent ggseqlogo package.
A Position weight matrix (PWM) can be used to scan one or multiple “subject” sequences to search the location of the corresponding motif.
At each position, the “subject” sequence is given a score based on the values present in the PWM.
We select the JASPAR2014 motif:
MotifDb returns a PFM with columns summing to 1. The number of sequences used to build the PFM are available in the elementMetadata
or mcols
of the object. We use this information to obtain a PWM:
nseq <- as.integer(mcols(EcRMotifs[4])$sequenceCount)
ecrpfm <- apply(round(nseq * EcrJASP,0), 2, as.integer)
rownames(ecrpfm) <- rownames(EcrJASP)
EcrJASP <- PWM(ecrpfm)
Here, we use the default background nucleotide frequencies (25% for each A, T, G and C) but this should be adapted depending on the genome studied.
Search for this motif on both strands of chromosome 2L:
EcRJASP_2L=matchPWM(EcrJASP,
Dmelanogaster$chr2L,
min.score='90%')
EcRJASP_2L_rev=matchPWM(reverseComplement(EcrJASP),
Dmelanogaster$chr2L,
min.score='90%')
#One way to merge the results:
EcRJASP_2L_all <- Views(Dmelanogaster$chr2L,
union(ranges(EcRJASP_2L),
ranges(EcRJASP_2L_rev)))
We can use the matchPWM
function to search a motif directly in a BSgenome object.
In this case, the search is automatically done on both strands and the function returns a convenient GRanges object.
## GRanges object with 70653 ranges and 2 metadata columns:
## seqnames ranges strand | score string
## <Rle> <IRanges> <Rle> | <numeric> <DNAStringSet>
## [1] chr2L 8094-8108 + | 0.910677 AAAGTCAGTGAAACC
## [2] chr2L 8746-8760 + | 0.905807 GAGGTCATTAACTTT
## [3] chr2L 17627-17641 + | 0.924109 CCCTTTAATGAACTA
## [4] chr2L 19791-19805 + | 0.902867 AAAATAAATGACCCC
## [5] chr2L 20432-20446 + | 0.924109 CCCTTTAATGAACTA
## ... ... ... ... . ... ...
## [70649] chrUextra 28702900-28702914 - | 0.904921 AAGGACATTGTAATC
## [70650] chrUextra 28715402-28715416 - | 0.942264 ACATTCAATGCACTT
## [70651] chrUextra 28795481-28795495 - | 0.907944 AGGGTTATTGTCTCA
## [70652] chrUextra 28808899-28808913 - | 0.907944 AGGGTTATTGTCTCA
## [70653] chrUextra 28834409-28834423 - | 0.901153 CCAATCATTGCCTTA
## -------
## seqinfo: 15 sequences (1 circular) from dm3 genome
TFBS stands for Transcription Factor Binding Site.
The TFBSTools package (Tan and Lenhard 2016) includes several features to manipulate and analyze sequence motifs or potential TFBS:
XMatrix
), collections of motifs (XMatrixList
), sets of motifs obtained with MEME (MotifSet
) or from a motifSearch on a sequence (SiteSet
)searchSeq
) and to scan aligned sequences (searchAln
) or genomes (searchPairBSgenome
)We search for the EcR:Usp binding motif in JASPAR2018 database:
Here is its sequence logo:
(#fig:SeqLogo_Ecr2018)Sequence logo for EcR motif (JASPAR2018)
Compare EcR PWM with the PWM for a human nuclear receptor ESR1, the Drosophila CTCF or a randomly permuted PWM:
#hESR1:
ESR1 <- TFBSTools::getMatrixByID(JASPAR2018, "MA0112")
PWMSimilarity(toPWM(EcR2018), toPWM(ESR1), method = "Pearson")
## [1] 0.5350265
#dCTCF
CTCF <- TFBSTools::getMatrixByID(JASPAR2018, "MA0531")
PWMSimilarity(toPWM(EcR2018), toPWM(CTCF), method = "Pearson")
## [1] 0.4987823
#random permutation:
PWMSimilarity(toPWM(EcR2018), toPWM(permuteMatrix(EcR2018)), method = "Pearson")
## [1] 0.2544939
Search for EcR motif on chr2L:
searchPairBSgenome
is another interesting function that searches for a motif in regions that are conserved between 2 genomes given as BSgenome objects.
Depending on how many patterns/motif we have and how many sequences we want to search these patterns in, different functions are available in the Biostrings package:
A pattern or motif can also be represented as a character string possibly using the IUPAC ambiguity code.
Here, for simplicity, we will represent ambiguities as ‘N’.
To perform pattern matching, we define a consensus sequence for the IR1 ecdysone response element:
## [1] "NAGTTCATTGACCTT"
EcR_IR1_cons=substring(EcR_IR1_cons, first = 2) # remove the first N that is useless here
EcR_IR1_cons
## [1] "AGTTCATTGACCTT"
Similarly, we can build consensus sequences for the EcR itself or for its binding partner Ultraspiracle (Usp):
EcR_cons=substring(consensusString(reverseComplement(EcRMotifs[[1]]),
ambiguityMap="N"),
first=2)
Usp_cons=substring(consensusString(MotifDb::query(MotifDb,"Usp")[[143]],
ambiguityMap="N"),
first=1,
last=9)
Now, we search for the EcR consensus in chromosome 2L:
EcR_on_2L = matchPattern(EcR_cons, Dmelanogaster$chr2L)
EcR_on_2L_rev = matchPattern(reverseComplement(DNAString(EcR_cons)),
Dmelanogaster$chr2L)
EcR_on_2L_all = Views(Dmelanogaster$chr2L,
union(ranges(EcR_on_2L),
ranges(EcR_on_2L_rev)))
Note that we have found a perfect palindrome (IR0):
## Views on a 23011544-letter DNAString subject
## subject: CGACAATGCACGACAGAGGAAGCAGAACAGATAT...ATATTTGCAAATTTTGATGAACCCCCCTTTCAAA
## views:
## start end width
## [1] 7208722 7208733 12 [AGGTCATGACCT]
It is also possible to search for a single pattern in several subject sequences using the vmatchPattern
function.
A typical example is searching for a pattern in promoters or ChIP-seq peaks.
Search for the EcR pattern in TSS upstream sequences:
Get the number of matches per subject element:
## nmatch_per_seq
## 0 1 2 3
## 24765 1638 48 3
Let’s take look at one of the upstream sequence with the maximum number of matches:
## Views on a 2000-letter DNAString subject
## subject: AAATATCTATTTTCTTAGTTAAGTCCCATCAAAA...CATTCAAAATAACTGTATTAGTGGACTTTTCTGG
## views:
## start end width
## [1] 1633 1639 7 [AGGTCAT]
## [2] 1809 1815 7 [AGGTCAT]
## [3] 1829 1835 7 [AGGTCAT]
One may also search for multiple patterns in a single subject sequence using matchPDict
Get all PWM matrices available for Drosophila melanogaster:
We obtain 1437 motifs.
Keep only the motifs that are 8bp-long and get their consensus sequences:
motif_ln = sapply(dm_matrices,ncol)
dm_matrices = dm_matrices[motif_ln==8]
dm_motifs = DNAStringSet(sapply(dm_matrices, consensusString, ambiguityMap="N"))
We obtain 223 motifs that are 8-bp long.
Search for all these motifs in chromosome 2L:
mot8_on_2L=matchPDict(dm_motifs,Dmelanogaster$chr2L,fixed=FALSE)
summary(lengths(mot8_on_2L)) #Number of matches
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 284 782 2905 9319 7426 200822
## IRanges object with 6 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## Dmelanogaster-cisbp_1.02-M0111_1.02 118 125 8
## Dmelanogaster-cisbp_1.02-M0111_1.02 196 203 8
## Dmelanogaster-cisbp_1.02-M0111_1.02 576 583 8
## Dmelanogaster-cisbp_1.02-M0111_1.02 654 661 8
## Dmelanogaster-cisbp_1.02-M0111_1.02 1034 1041 8
## Dmelanogaster-cisbp_1.02-M0111_1.02 1112 1119 8
The motif most frequently found on chromosome 2L:
## [1] "Dmelanogaster-cisbp_1.02-M0111_1.02"
## [1] "NTTNATNN"
The motif less frequently found on chromosome 2L:
## [1] "Dmelanogaster-FlyFactorSurvey-Adf1_FlyReg_FBgn0000054"
## [1] "CGACCGCG"
Finally, some functions are available to search for multiple patterns in multiple sequences.
Remove the motifs containing N bases and create a dictionary of motifs (the motifs must have the same length):
Search for the motifs in TSS upstream sequences:
Number of motifs found:
## [1] 721 716 716 721 3063 432 414 321 321 324 324 635 635 460 460
## [16] 2536 706 706 290 6162 2327 542 3063 724 708 5781 1579 1340 702 4384
## [31] 1939 6411 473 823 4560 742 714 952 432 432 7916 6411 414 419 1284
## [46] 1103 321 321 324 324 574 574 6411 2527 3027 2549 469 469 885 1461
## [61] 532 6411 724 1284 2527 885 6411 724 1284 2527 6411 724 1284 2527 885
## [76] 515 6411 724 1284 2527 885 515 6411 724 1284 2527 515
Number of motif1 per upstream sequence:
##
## 0 1 2
## 25760 667 27
Number of motifs per upstream sequence:
Plot the number of upstream sequences as a function of the number of motifs:
nMot8_perSeq = nMot8_perSeq[nMot8_perSeq >= 1]
plot(as.integer(names(table(nMot8_perSeq))),
as.integer(table(nMot8_perSeq)),
pch = 15, type = "b", log = "y", col = "blue",
xlab = "Number of 8bp-long motifs",
ylab = "Number of upstream sequences")
(#fig:fig_Mot8PerUpSeq)Motifs per upstream sequence.
Often, we need to export some of the sequences as fasta files, for example to analyze them in an external software.
Let’s say that we want to extract the promoter sequences that have at least forty 8bp-long motifs.
Get the name of these promoters:
We have 12 such promoter sequences.
Now let’s export them as a fasta file:
So the writeXStringSet
function is the only function you need to remember to export your DNAStringSet
or any XStringSet
object
Enjoy your sequence analyses in R !