1 Introduction

The Biostrings package is the core package to manipulate sequence files in R.
Here, we will see some basic usage of this package.

2 Containers and accessors

2.1 XString and XStringSet containers

First, let’s see how to import a fasta file. For this, we will use a file provided with the Biostrings package that contains the sequences 2000 upstream of the annotated transcription start sites (TSS) for the Drosophila melanogaster genome.

Get the file path:

dm3_upstream_filepath = system.file("extdata", "dm3_upstream2000.fa.gz", package="Biostrings")

This file contains DNA sequences so we use the readDNAStringSet function to import it:

dm3_upstream = readDNAStringSet(dm3_upstream_filepath)
dm3_upstream
## DNAStringSet object of length 26454:
##         width seq                                           names               
##     [1]  2000 GTTGGTGGCCCACCAGTGCCA...CAAGTTTACCGGTTGCACGGT NM_078863_up_2000...
##     [2]  2000 TTATTTATGTAGGCGCCCGTT...ATACGGAAAGTCATCCTCGAT NM_001201794_up_2...
##     [3]  2000 TTATTTATGTAGGCGCCCGTT...ATACGGAAAGTCATCCTCGAT NM_001201795_up_2...
##     [4]  2000 TTATTTATGTAGGCGCCCGTT...ATACGGAAAGTCATCCTCGAT NM_001201796_up_2...
##     [5]  2000 TTATTTATGTAGGCGCCCGTT...ATACGGAAAGTCATCCTCGAT NM_001201797_up_2...
##     ...   ... ...
## [26450]  2000 ATTTACAAGACTAATAAAGAT...AAAATTAAATTTCAATAAAAC NM_001111010_up_2...
## [26451]  2000 GATATACGAAGGACGACCTGC...TTGTTTGAGTTGTTATATATT NM_001015258_up_2...
## [26452]  2000 GATATACGAAGGACGACCTGC...TTGTTTGAGTTGTTATATATT NM_001110997_up_2...
## [26453]  2000 GATATACGAAGGACGACCTGC...TTGTTTGAGTTGTTATATATT NM_001276245_up_2...
## [26454]  2000 CGTATGTATTAGTTAACTCTG...TCAAAGTGTAAGAACAAATTG NM_001015497_up_2...

The sequences are imported as a DNAStringSet object.
There are similar functions and containers to work with:

  • RNA sequences (T becomes U): readRNAStringSet gives an RNAStringSet
  • Protein sequences : readAAStringSet gives an AAStringSet
  • Or any type of character strings (e.g. ASCII-encoded sequence quality): readBStringSet gives a BStringSet

The DNAStringSet object can be considered as a list of DNAString objects.
We can thus extract the 3rd sequence using:

dm3_upstream[[3]]
## 2000-letter DNAString object
## seq: TTATTTATGTAGGCGCCCGTTCCCGCAGCCAAAGCA...CGAATTAATCGATAGATACGGAAAGTCATCCTCGAT
# or, if we know the name of the sequence:
dm3_upstream[["NM_001201795_up_2000_chr2L_8382455_f chr2L:8382455-8384454"]]
## 2000-letter DNAString object
## seq: TTATTTATGTAGGCGCCCGTTCCCGCAGCCAAAGCA...CGAATTAATCGATAGATACGGAAAGTCATCCTCGAT

Here are 2 different ways to extract a subsequence from a DNAString:

dm3_upstream[[3]][2:30]
## 29-letter DNAString object
## seq: TATTTATGTAGGCGCCCGTTCCCGCAGCC
subseq(dm3_upstream[[3]], start=2, end=30)
## 29-letter DNAString object
## seq: TATTTATGTAGGCGCCCGTTCCCGCAGCC

But only the subseq command would work on a DNAStringSet:

subseq(dm3_upstream[1:3], start=2, end=12)
## DNAStringSet object of length 3:
##     width seq                                               names               
## [1]    11 TTGGTGGCCCA                                       NM_078863_up_2000...
## [2]    11 TATTTATGTAG                                       NM_001201794_up_2...
## [3]    11 TATTTATGTAG                                       NM_001201795_up_2...

It is also possible to extract multiple subsequences from a DNAString using Views:

Views(dm3_upstream[[3]],start=c(1,11,21),end=c(10,20,30))
## Views on a 2000-letter DNAString subject
## subject: TTATTTATGTAGGCGCCCGTTCCCGCAGCCAAAG...AATTAATCGATAGATACGGAAAGTCATCCTCGAT
## views:
##       start end width
##   [1]     1  10    10 [TTATTTATGT]
##   [2]    11  20    10 [AGGCGCCCGT]
##   [3]    21  30    10 [TCCCGCAGCC]

Views objects actually store the original sequence (or subject) and the intervals or ranges on this sequence.
Because we were working on a DNAString object, the Views function returned an XStringViews object (See ?XStringViews for details and methods).

To convert a DNAString to a character string:

toString(dm3_upstream[[3]][1:10])
## [1] "TTATTTATGT"
# as.character also works but toString works on DNAStringSt objects too:
toString(subseq(dm3_upstream[1:2], start=1, end=10))
## [1] "GTTGGTGGCC, TTATTTATGT"

2.2 Working with whole genome sequences: BSgenome

Whole genome sequences are stored into BSgenome objects.
Many genomes are available in Bioconductor.
But you can also build your own using the BSgenomeForge package.

Let’s load and check the dm3 BSgenome package:

library(BSgenome.Dmelanogaster.UCSC.dm3)

What is a bit confusing at the beginning is that loading this package creates an object named Dmelanogaster:

Dmelanogaster
## | BSgenome object for Fly
## | - organism: Drosophila melanogaster
## | - provider: UCSC
## | - genome: dm3
## | - release date: Apr. 2006
## | - 15 sequence(s):
## |     chr2L     chr2R     chr3L     chr3R     chr4      chrX      chrU     
## |     chrM      chr2LHet  chr2RHet  chr3LHet  chr3RHet  chrXHet   chrYHet  
## |     chrUextra                                                            
## | 
## | Tips: call 'seqnames()' on the object to get all the sequence names, call
## | 'seqinfo()' to get the full sequence info, use the '$' or '[[' operator to
## | access a given sequence, see '?BSgenome' for more information.
names(Dmelanogaster)
##  [1] "chr2L"     "chr2R"     "chr3L"     "chr3R"     "chr4"      "chrX"     
##  [7] "chrU"      "chrM"      "chr2LHet"  "chr2RHet"  "chr3LHet"  "chr3RHet" 
## [13] "chrXHet"   "chrYHet"   "chrUextra"
Dmelanogaster[["chr2L"]] #or: Dmelanogaster$chr2L
## 23011544-letter DNAString object
## seq: CGACAATGCACGACAGAGGAAGCAGAACAGATATTT...GCATATTTGCAAATTTTGATGAACCCCCCTTTCAAA

2.3 Working with large fasta files: Rsamtools

The Rsamtools also provides interesting functions to work on large indexed FASTA files (e.g. containing a whole genome sequence).
The short example below illustrates how to extract from a FASTA file a set of sequences defined by a GRanges:

library(Rsamtools)

indFaEx_path=system.file("extdata","ce2dict1.fa",package="Rsamtools")
indFaEx=FaFile(indFaEx_path) #the whole sequence is actually not imported at this step
getSeq(indFaEx,GRanges(seqnames=Rle(c("pattern01","pattern04")),
                       ranges=IRanges(start=c(3,10),end=c(10,24)),
                       strand="*"))
## DNAStringSet object of length 2:
##     width seq                                               names               
## [1]     8 GAAACTAG                                          pattern01
## [2]    15 TTGTTGCAAATTTGA                                   pattern04

Note that the Biostrings package also has useful functions to extract some information from fasta/fastq files without loading the file in memory. See help(fasta.seqlengths, package = "Biostrings") for the corresponding help page.

2.4 Some useful functions

Reverse complement:

reverseComplement(dm3_upstream[1:2])
## DNAStringSet object of length 2:
##     width seq                                               names               
## [1]  2000 ACCGTGCAACCGGTAAACTTGCC...TTTGGCACTGGTGGGCCACCAAC NM_078863_up_2000...
## [2]  2000 ATCGAGGATGACTTTCCGTATCT...GGAACGGGCGCCTACATAAATAA NM_001201794_up_2...

Count the occurence of each nucleotide:

alphabetFrequency(dm3_upstream[1:2],
                  baseOnly=TRUE,
                  as.prob=TRUE)
##          A     C      G      T other
## [1,] 0.323 0.191 0.1875 0.2985     0
## [2,] 0.300 0.207 0.2230 0.2700     0

Get the GC content of a sequence:

letterFrequency(Dmelanogaster$chr2L, "CG", as.prob=TRUE)
##     C|G 
## 0.41835

We will not go into the details but there are also functions to perform sequence alignment in R (pwalign package):

pwalign::pairwiseAlignment(dm3_upstream[[1]][1:30],
                           dm3_upstream[[2]][1:30],
                           type = 'global')
## Global PairwiseAlignmentsSingleSubject (1 of 1)
## pattern: GT-----TGGTGGCCCACCAGTGCCAAAATACACA
## subject: TTATTTATGTAGGCGCCC--GTTCCCGCA---GCC
## score: -107.1475

3 Motifs and pattern matching

Sequence motifs or patterns are typically used to represent the DNA regions bound by transcription factors (TFs) and other regulatory proteins.
We distinguish a pattern which is a sequence, possibly using the IUPAC code when there is an ambiguity between nucleotides, from a motif which is a matrix basically indicating what is the probablity to find a given nucleotide at a given position in the sequence.
There are several packages in Bioconductor to search for and identify such patterns in strings and sequences. Pattern matching is frequently used in ChIP-seq experiments to search e.g. for binding sites of transcription factors.
A complete Bioconductor workflow illustrates the use of pattern matching for the identification of transcription factor binding sites.

Here, we illustrate the use of the following packages:

Other packages of interest include:

You can search the MotifDiscovery and MotifAnnotation BioCViews to find other useful packages.

3.1 Searching and plotting motifs

The MotifDb package allows to query a collection of DNA motifs aggregated from different databases.

We search these databases for the response element of the ecdysone receptor (EcR):

EcRMotifs=MotifDb::query(MotifDb,"EcR")
EcRMotifs
## MotifDb object of length 5
## | Created from downloaded public sources, last update: 2022-Mar-04
## | 5 position frequency matrices from 5 sources:
## |    FlyFactorSurvey:    1
## |        JASPAR_2014:    1
## |         jaspar2016:    1
## |         jaspar2018:    1
## |         jaspar2022:    1
## | 1 organism/s
## |      Dmelanogaster:    5
## Dmelanogaster-FlyFactorSurvey-EcR_SANGER_5_FBgn0000546 
## Dmelanogaster-JASPAR_2014-EcR::usp-MA0534.1 
## Dmelanogaster-jaspar2016-EcR::usp-MA0534.1 
## Dmelanogaster-jaspar2018-EcR::usp-MA0534.1 
## Dmelanogaster-jaspar2022-EcR::usp-MA0534.1
EcRMotifs[[1]]
##           1 2 3 4          5 6         7   8
## A 0.6428571 0 0 1 0.00000000 0 0.0000000 0.0
## C 0.0000000 0 0 0 0.92857143 1 0.1428571 0.5
## G 0.0000000 0 1 0 0.07142857 0 0.0000000 0.0
## T 0.3571429 1 0 0 0.00000000 0 0.8571429 0.5

The motif is given as a Position Frequency Matrix (PFM), which, can be converted to a Position-weight matrix using the information present in the metadata of the object (number of sequences used to define the motif: see elementMetadata(EcRMotifs)).

The package seqLogo allows to easily plot sequence logos:

## 
## Attaching package: 'grid'
## The following object is masked from 'package:Biostrings':
## 
##     pattern
## Loading required package: motifStack
## 

Here is the logo of the first motif:

seqLogo::seqLogo(EcRMotifs[[1]])
Sequence logo for EcR motif

(#fig:SeqLogo_Ecr)Sequence logo for EcR motif

Or its reverse complement:

seqLogo::seqLogo(reverseComplement(EcRMotifs[[1]]))
Sequence logo for EcR motif reverse complement

(#fig:SeqLogo_Ecr_revcomp)Sequence logo for EcR motif reverse complement

And the logo of the EcR motif retrieved from JASPAR 2022:

seqLogo::seqLogo(EcRMotifs[[4]])
Sequence logo for EcR:Usp heterodimer

(#fig:SeqLogo_EcrUsp)Sequence logo for EcR:Usp heterodimer

The second motif (from JASPAR) represented above is a binding site for the heterodimer composed of the ecdysone receptor (EcR) and its binding partner Ultraspiracle protein (Usp). This imperfect palindromic ecdysone response element is an inverted repeat of the consensus motif AGGTCA (or AGTTCA) separated by 1 nucleotide (IR1 for “inverted repeat separated by 1 nucleotide”).

The package motifStack provides additional plotting functionalities to plot multiple sequence logos, as illustrated below:

#Format the PFMs:
pfms <- as.list(EcRMotifs)
names(pfms) <- gsub("Dmelanogaster-", "", names(pfms))
pfms <- lapply(names(pfms),
               function(x) {new("pfm",
                                mat=pfms[[x]],
                                name = x)})

#Plot
motifStack(pfms, layout="tree")
## Loading required namespace: Cairo
Sequence logos for EcR motifs using motifStack

(#fig:motifStack_Ecr)Sequence logos for EcR motifs using motifStack

For those using ggplot2 for plotting, there is also the excellent ggseqlogo package.

3.2 Scanning a sequence with a Position-weight matrix

A Position weight matrix (PWM) can be used to scan one or multiple “subject” sequences to search the location of the corresponding motif.
At each position, the “subject” sequence is given a score based on the values present in the PWM.

Principle of sequence scoring with a PWM
Principle of sequence scoring with a PWM

We select the JASPAR2014 motif:

EcrJASP <- EcRMotifs[[4]]

MotifDb returns a PFM with columns summing to 1. The number of sequences used to build the PFM are available in the elementMetadata or mcols of the object. We use this information to obtain a PWM:

nseq <- as.integer(mcols(EcRMotifs[4])$sequenceCount)
ecrpfm <- apply(round(nseq * EcrJASP,0), 2, as.integer)
rownames(ecrpfm) <- rownames(EcrJASP)
EcrJASP <- PWM(ecrpfm)

Here, we use the default background nucleotide frequencies (25% for each A, T, G and C) but this should be adapted depending on the genome studied.

Search for this motif on both strands of chromosome 2L:

EcRJASP_2L=matchPWM(EcrJASP,
                    Dmelanogaster$chr2L,
                    min.score='90%')
EcRJASP_2L_rev=matchPWM(reverseComplement(EcrJASP),
                        Dmelanogaster$chr2L,
                        min.score='90%')
#One way to merge the results:
EcRJASP_2L_all <- Views(Dmelanogaster$chr2L,
                        union(ranges(EcRJASP_2L),
                              ranges(EcRJASP_2L_rev)))

We can use the matchPWM function to search a motif directly in a BSgenome object.
In this case, the search is automatically done on both strands and the function returns a convenient GRanges object.

EcRJASP_all <- matchPWM(EcrJASP, Dmelanogaster, min.score="90%")
EcRJASP_all
## GRanges object with 70653 ranges and 2 metadata columns:
##            seqnames            ranges strand |     score          string
##               <Rle>         <IRanges>  <Rle> | <numeric>  <DNAStringSet>
##       [1]     chr2L         8094-8108      + |  0.910677 AAAGTCAGTGAAACC
##       [2]     chr2L         8746-8760      + |  0.905807 GAGGTCATTAACTTT
##       [3]     chr2L       17627-17641      + |  0.924109 CCCTTTAATGAACTA
##       [4]     chr2L       19791-19805      + |  0.902867 AAAATAAATGACCCC
##       [5]     chr2L       20432-20446      + |  0.924109 CCCTTTAATGAACTA
##       ...       ...               ...    ... .       ...             ...
##   [70649] chrUextra 28702900-28702914      - |  0.904921 AAGGACATTGTAATC
##   [70650] chrUextra 28715402-28715416      - |  0.942264 ACATTCAATGCACTT
##   [70651] chrUextra 28795481-28795495      - |  0.907944 AGGGTTATTGTCTCA
##   [70652] chrUextra 28808899-28808913      - |  0.907944 AGGGTTATTGTCTCA
##   [70653] chrUextra 28834409-28834423      - |  0.901153 CCAATCATTGCCTTA
##   -------
##   seqinfo: 15 sequences (1 circular) from dm3 genome

3.3 TFBSTools

TFBS stands for Transcription Factor Binding Site.
The TFBSTools package (Tan and Lenhard 2016) includes several features to manipulate and analyze sequence motifs or potential TFBS:

  • Containers for motifs (XMatrix), collections of motifs (XMatrixList), sets of motifs obtained with MEME (MotifSet) or from a motifSearch on a sequence (SiteSet)
  • Containers and graphical representations of TFFM (Transcription Factor Flexible Model), another type of motif representation (Mathelier and Wasserman 2013)
  • Functions to convert between PWM (position-weight matrix), PFM (position frequency matrix) and ICM (information content matrix)
  • Functions to compare motifs as matrices (PFMs/PWMs) or with IUPAC strings
  • Functions to scan sequences or whole genome (searchSeq) and to scan aligned sequences (searchAln) or genomes (searchPairBSgenome)
  • Functions to query the JASPAR database, using Bioconductor packages such as JASPAR2024
  • Wrapper function for the de novo motif dicovery tool MEME
Common workflow and classes in TFBSTools (Tan and Lenhard 2016)
Common workflow and classes in TFBSTools (Tan and Lenhard 2016)

We search for the EcR:Usp binding motif in JASPAR2018 database:

EcR2018 <- TFBSTools::getMatrixByID(JASPAR2018, "MA0534")

Here is its sequence logo:

TFBSTools::seqLogo(toICM(EcR2018))
Sequence logo for EcR motif (JASPAR2018)

(#fig:SeqLogo_Ecr2018)Sequence logo for EcR motif (JASPAR2018)

Compare EcR PWM with the PWM for a human nuclear receptor ESR1, the Drosophila CTCF or a randomly permuted PWM:

#hESR1:
ESR1 <- TFBSTools::getMatrixByID(JASPAR2018, "MA0112")
PWMSimilarity(toPWM(EcR2018), toPWM(ESR1), method = "Pearson")
## [1] 0.5350265
#dCTCF
CTCF <- TFBSTools::getMatrixByID(JASPAR2018, "MA0531")
PWMSimilarity(toPWM(EcR2018), toPWM(CTCF), method = "Pearson")
## [1] 0.4987823
#random permutation:
PWMSimilarity(toPWM(EcR2018), toPWM(permuteMatrix(EcR2018)), method = "Pearson")
## [1] 0.2544939

Search for EcR motif on chr2L:

EcR2018_on2L <- searchSeq(toPWM(EcR2018), Dmelanogaster$chr2L, min.score="90%")

searchPairBSgenome is another interesting function that searches for a motif in regions that are conserved between 2 genomes given as BSgenome objects.

3.4 Scanning sequences with strings

Depending on how many patterns/motif we have and how many sequences we want to search these patterns in, different functions are available in the Biostrings package:

Biostrings functions for pattern matching
Biostrings functions for pattern matching

3.4.1 Scanning multiple sequences with one string

A pattern or motif can also be represented as a character string possibly using the IUPAC ambiguity code.
Here, for simplicity, we will represent ambiguities as ‘N’.

To perform pattern matching, we define a consensus sequence for the IR1 ecdysone response element:

EcR_IR1_cons = consensusString(EcRMotifs[[4]], ambiguityMap="N")
EcR_IR1_cons
## [1] "NAGTTCATTGACCTT"
EcR_IR1_cons=substring(EcR_IR1_cons, first = 2) # remove the first N that is useless here
EcR_IR1_cons
## [1] "AGTTCATTGACCTT"

Similarly, we can build consensus sequences for the EcR itself or for its binding partner Ultraspiracle (Usp):

EcR_cons=substring(consensusString(reverseComplement(EcRMotifs[[1]]),
                                   ambiguityMap="N"),
                   first=2)
Usp_cons=substring(consensusString(MotifDb::query(MotifDb,"Usp")[[143]],
                                   ambiguityMap="N"),
                   first=1,
                   last=9)

Now, we search for the EcR consensus in chromosome 2L:

EcR_on_2L = matchPattern(EcR_cons, Dmelanogaster$chr2L)
EcR_on_2L_rev = matchPattern(reverseComplement(DNAString(EcR_cons)),
                             Dmelanogaster$chr2L)
EcR_on_2L_all = Views(Dmelanogaster$chr2L,
                      union(ranges(EcR_on_2L),
                            ranges(EcR_on_2L_rev)))

Note that we have found a perfect palindrome (IR0):

EcR_on_2L_all[width(EcR_on_2L_all)!=7]
## Views on a 23011544-letter DNAString subject
## subject: CGACAATGCACGACAGAGGAAGCAGAACAGATAT...ATATTTGCAAATTTTGATGAACCCCCCTTTCAAA
## views:
##         start     end width
##   [1] 7208722 7208733    12 [AGGTCATGACCT]

3.4.2 Scanning multiple sequences with one string.

It is also possible to search for a single pattern in several subject sequences using the vmatchPattern function.
A typical example is searching for a pattern in promoters or ChIP-seq peaks.

Search for the EcR pattern in TSS upstream sequences:

EcR_on_up=vmatchPattern(EcR_cons, dm3_upstream)

Get the number of matches per subject element:

nmatch_per_seq = elementNROWS(EcR_on_up) #or lengths(EcR_on_up)
table(nmatch_per_seq)
## nmatch_per_seq
##     0     1     2     3 
## 24765  1638    48     3

Let’s take look at one of the upstream sequence with the maximum number of matches:

i0=which.max(nmatch_per_seq)
Views(dm3_upstream[[i0]], EcR_on_up[[i0]])
## Views on a 2000-letter DNAString subject
## subject: AAATATCTATTTTCTTAGTTAAGTCCCATCAAAA...CATTCAAAATAACTGTATTAGTGGACTTTTCTGG
## views:
##       start  end width
##   [1]  1633 1639     7 [AGGTCAT]
##   [2]  1809 1815     7 [AGGTCAT]
##   [3]  1829 1835     7 [AGGTCAT]

3.4.3 Scanning one sequence with multiple strings.

One may also search for multiple patterns in a single subject sequence using matchPDict
Get all PWM matrices available for Drosophila melanogaster:

dm_matrices = MotifDb::query(MotifDb,"dmelanogaster")

We obtain 1437 motifs.

Keep only the motifs that are 8bp-long and get their consensus sequences:

motif_ln = sapply(dm_matrices,ncol)
dm_matrices = dm_matrices[motif_ln==8]
dm_motifs = DNAStringSet(sapply(dm_matrices, consensusString, ambiguityMap="N"))

We obtain 223 motifs that are 8-bp long.

Search for all these motifs in chromosome 2L:

mot8_on_2L=matchPDict(dm_motifs,Dmelanogaster$chr2L,fixed=FALSE)
summary(lengths(mot8_on_2L)) #Number of matches
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     284     782    2905    9319    7426  200822
head(unlist(mot8_on_2L)) #first 6 matches
## IRanges object with 6 ranges and 0 metadata columns:
##                                           start       end     width
##                                       <integer> <integer> <integer>
##   Dmelanogaster-cisbp_1.02-M0111_1.02       118       125         8
##   Dmelanogaster-cisbp_1.02-M0111_1.02       196       203         8
##   Dmelanogaster-cisbp_1.02-M0111_1.02       576       583         8
##   Dmelanogaster-cisbp_1.02-M0111_1.02       654       661         8
##   Dmelanogaster-cisbp_1.02-M0111_1.02      1034      1041         8
##   Dmelanogaster-cisbp_1.02-M0111_1.02      1112      1119         8

The motif most frequently found on chromosome 2L:

names(dm_motifs[which.max(lengths(mot8_on_2L))])
## [1] "Dmelanogaster-cisbp_1.02-M0111_1.02"
toString(dm_motifs[[which.max(lengths(mot8_on_2L))]])
## [1] "NTTNATNN"

The motif less frequently found on chromosome 2L:

names(dm_motifs[which.min(lengths(mot8_on_2L))])
## [1] "Dmelanogaster-FlyFactorSurvey-Adf1_FlyReg_FBgn0000054"
toString(dm_motifs[[which.min(lengths(mot8_on_2L))]])
## [1] "CGACCGCG"

3.4.4 Scanning multiple sequences with multiple strings

Finally, some functions are available to search for multiple patterns in multiple sequences.

Remove the motifs containing N bases and create a dictionary of motifs (the motifs must have the same length):

dm_mot8_dict=PDict(dm_motifs[sapply(dm_motifs, hasOnlyBaseLetters)])

Search for the motifs in TSS upstream sequences:

mot8_count_upstream=vcountPDict(dm_mot8_dict,dm3_upstream)

Number of motifs found:

apply(mot8_count_upstream, 1, sum)
##  [1]  721  716  716  721 3063  432  414  321  321  324  324  635  635  460  460
## [16] 2536  706  706  290 6162 2327  542 3063  724  708 5781 1579 1340  702 4384
## [31] 1939 6411  473  823 4560  742  714  952  432  432 7916 6411  414  419 1284
## [46] 1103  321  321  324  324  574  574 6411 2527 3027 2549  469  469  885 1461
## [61]  532 6411  724 1284 2527  885 6411  724 1284 2527 6411  724 1284 2527  885
## [76]  515 6411  724 1284 2527  885  515 6411  724 1284 2527  515

Number of motif1 per upstream sequence:

table(mot8_count_upstream[1,])
## 
##     0     1     2 
## 25760   667    27

Number of motifs per upstream sequence:

nMot8_perSeq = colSums(mot8_count_upstream)
names(nMot8_perSeq) = names(dm3_upstream)

Plot the number of upstream sequences as a function of the number of motifs:

nMot8_perSeq = nMot8_perSeq[nMot8_perSeq >= 1]
plot(as.integer(names(table(nMot8_perSeq))),
     as.integer(table(nMot8_perSeq)),
         pch = 15, type = "b", log = "y", col = "blue",
         xlab = "Number of 8bp-long motifs",
         ylab = "Number of upstream sequences")
Motifs per upstream sequence.

(#fig:fig_Mot8PerUpSeq)Motifs per upstream sequence.

4 Exporting as fasta

Often, we need to export some of the sequences as fasta files, for example to analyze them in an external software.
Let’s say that we want to extract the promoter sequences that have at least forty 8bp-long motifs. Get the name of these promoters:

myprom40 = names(nMot8_perSeq[nMot8_perSeq >= 40])

We have 12 such promoter sequences.

Now let’s export them as a fasta file:

writeXStringSet(dm3_upstream[myprom40], "myfastafile.fa")

So the writeXStringSet function is the only function you need to remember to export your DNAStringSet or any XStringSet object

Enjoy your sequence analyses in R !

References

Mathelier, Anthony, and Wyeth W Wasserman. 2013. “The Next Generation of Transcription Factor Binding Site Prediction.” PLoS Comput. Biol. 9 (9): e1003214.
Tan, G., and B. Lenhard. 2016. TFBSTools: an R/bioconductor package for transcription factor binding site analysis.” Bioinformatics 32 (10): 1555–56.