1 Introduction

It is important to recognize that a large fraction of the data that we manipulate in genomics consists of genomic intervals.
A genomic interval is defined by:

  • a chromosome or contig
  • a strand (+, - or * for undefined)
  • a start position
  • an end position

In addition, the genomic interval can have metadata attached to it, such as a score, a GC% or any number of quantitative or qualitative data.

Thus:

  1. Ranges allow to represent a wide array of genomic data and annotations
  2. Several biological questions reflect range-based queries

ChIP-seq peaks, gene/exons/enhancer or any other annotations, localization of a motif or a particular sequence are obvious examples of genomic intervals.
But a read aligned to the genome, or a coverage along the genome can also be represented by genomic intervals.

Indeed, many file formats directly contain genomic intervals:

  • gtf/gff files for annotations
  • bed/bigBed or narrowPeak and other ENCODE files for intervals
  • bam/sam files or paf files for aligned sequences
  • vcf files for variants

Additionally, file formats containing coverages such as bedgraph, wig or bigwig files can also be represented as genomic intervals associated with a specific score.

Here are some questions that typically correspond to range-based queries:

  • Which peaks are unique to one condition and not present in the other (comparison of 2 ChIP-seq datasets)?
    => setdiff(peaks_condition1, peaks_condition2) (Set operations)

  • Which genes overlap with my ChIP-seq peaks?
    => findOverlaps(peaks, genes) (Overlap query)

  • What are the distances between my SNPs and the nearest transcription start site (TSS)?
    => nearest(snps, tss) (Nearest-neighbor query)

  • Which exons are fully contained with a given genomic region?
    => subsetByOverlaps(exons, region, type = "within") (Containment query)

  • Are their any enhancers with 5kb upstream of gene TSS?
    => findOverlaps(enhancer, promoters(gene, upstream=5000, downstream=0)) (Proximity-based overlap)

  • What are the regions not covered by my ChIP-seq peaks?
    => gaps(peaks) (Gap calculation)

  • Which genetic variants intersect with coding regions?
    => intersect(variants, coding_regions) (Set operations)

  • What are the sequences located within 250bp of my ChIP-seq peak summits?
    => getSeq(myBSgenome, resize(peaks, width=500, fix = "center")) (Window normalization)

When manipulating genomic intervals, one of the most confusing aspect (together with dealing with poorly formatted gtf/gff files 😉) is linked to the existence of two coordinate systems: 0-based or 1-based.
The figure below, extracted from this post summarizes the two systems:

0-based vs 1-based coordinate systems in genomics To summarize:

  • 0-based: start counting from 0, end is exclusive
  • 1-based: start counting from 1, end is inclusive

In Bioconductor, the 1-based system is always used.
However:

  • bed files use the 0-based system
  • gtf/gff, sam/bam or vcf files use the 1-based system

Thus it is important, when you import/export your data to/from R/Bioconductor, that you use functions that are aware of these different standards.
We will see in paragraph 9 that the rtracklayer package provides an interface to import/export most genomic file formats.

Bioconductor implements a number of tools to manipulate and analyze ranges and specifically genomic ranges (Lawrence et al. 2013)

2 IRanges and accessors

Let’s start with the simpler objects defined in the IRanges package.
An IRanges object is defined as follow:

  • defined by 2 vectors out of start, end, width (SEW ; \(end = start + width - 1\))
  • closed intervals (i.e. include end points)
  • zero-width convention: \(width \geq 0\) ; \(end=start-1 \Leftrightarrow width=0\)
  • can be named

A simple IRanges:

eg = IRanges(start = c(1, 10, 20),
              end = c(4, 10, 19),
              names = c("A", "B", "C"))
eg
## IRanges object with 3 ranges and 0 metadata columns:
##         start       end     width
##     <integer> <integer> <integer>
##   A         1         4         4
##   B        10        10         1
##   C        20        19         0

A bigger IRanges:

set.seed(123) #For reproducibility
start = floor(runif(10000, 1, 1000))
end = start + floor(runif(10000, 0, 100))
ir = IRanges(start, end)
ir
## IRanges object with 10000 ranges and 0 metadata columns:
##               start       end     width
##           <integer> <integer> <integer>
##       [1]       288       319        32
##       [2]       788       820        33
##       [3]       409       496        88
##       [4]       883       915        33
##       [5]       940       952        13
##       ...       ...       ...       ...
##    [9996]       466       493        28
##    [9997]       899       984        86
##    [9998]       114       125        12
##    [9999]       571       596        26
##   [10000]       900       966        67

IRanges accessors:

length(ir)
## [1] 10000
ir[1:4]
## IRanges object with 4 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]       288       319        32
##   [2]       788       820        33
##   [3]       409       496        88
##   [4]       883       915        33
start(ir[1:4])
## [1] 288 788 409 883
width(ir[1:4])
## [1] 32 33 88 33
names(eg)
## [1] "A" "B" "C"

Other useful methods for IRanges:

c(ir[1:2],ir[5:6]) #combining
## IRanges object with 4 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]       288       319        32
##   [2]       788       820        33
##   [3]       940       952        13
##   [4]        46        81        36
sort(ir[1:4])
## IRanges object with 4 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]       288       319        32
##   [2]       409       496        88
##   [3]       788       820        33
##   [4]       883       915        33
rank(ir[1:4],ties="first")
## [1] 1 3 2 4
mid(ir[1:4]) # midpoints
## [1] 303 804 452 899
tile(ir[1:2],n=2) #returns an IRangesList (see below)
## IRangesList object of length 2:
## [[1]]
## IRanges object with 2 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]       288       303        16
##   [2]       304       319        16
## 
## [[2]]
## IRanges object with 2 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]       788       803        16
##   [2]       804       820        17
rep(ir[1:2],each=2)
## IRanges object with 4 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]       288       319        32
##   [2]       288       319        32
##   [3]       788       820        33
##   [4]       788       820        33
isNormal(ir[1:4])
## [1] FALSE
isNormal(sort(ir[1:4])) #see ?'Ranges-class' for Normality definition
## [1] TRUE
isDisjoint(ir[1:4])
## [1] TRUE
match(ir[1:4],ir[4:1]) #see ?'Ranges-comparison' for Ranges comparison methods
## [1] 4 3 2 1
ir[1:4]>ir[4:1] # compares ranges a) first by start and b) then by width 
## [1] FALSE  TRUE FALSE  TRUE

Other methods creating IRanges:

as(c(2:10,8,90:100),"IRanges") #from a vector of integers
## IRanges object with 3 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]         2        10         9
##   [2]         8         8         1
##   [3]        90       100        11
successiveIRanges(width=rep(10,5),gap=10)
## IRanges object with 5 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]         1        10        10
##   [2]        21        30        10
##   [3]        41        50        10
##   [4]        61        70        10
##   [5]        81        90        10
whichAsIRanges(c(19, 5, 0, 8, 5)>=5) #transforms a logical vector in IRanges
## NormalIRanges object with 2 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]         1         2         2
##   [2]         4         5         2

It can be convenient to group ranges in a list (e.g. exons grouped by genes). IRangesList objects serve this purpose.
Accessors and functions for IRanges generally work on IRangesList objects.

irl=split(ir,width(ir)) # an IRangesList
irl[[1]]
## IRanges object with 96 ranges and 0 metadata columns:
##            start       end     width
##        <integer> <integer> <integer>
##    [1]       321       321         1
##    [2]       600       600         1
##    [3]       184       184         1
##    [4]       297       297         1
##    [5]       276       276         1
##    ...       ...       ...       ...
##   [92]       188       188         1
##   [93]       308       308         1
##   [94]       289       289         1
##   [95]       936       936         1
##   [96]       669       669         1
start(irl)
## IntegerList of length 100
## [["1"]] 321 600 184 297 276 816 87 729 407 ... 72 858 52 85 188 308 289 936 669
## [["2"]] 915 576 706 235 678 647 451 138 744 ... 638 66 979 740 300 869 433 645
## [["3"]] 457 415 336 774 487 787 587 352 400 ... 264 60 607 292 709 418 552 102
## [["4"]] 253 75 429 4 785 24 464 869 433 ... 224 827 240 175 459 130 146 681 267
## [["5"]] 498 812 977 120 991 583 959 931 532 ... 311 157 871 538 209 8 37 39 443
## [["6"]] 372 298 241 61 363 351 847 37 916 ... 571 910 929 656 71 839 223 34 602
## [["7"]] 753 198 674 584 850 250 962 14 543 ... 457 175 731 758 953 212 551 153
## [["8"]] 895 803 784 358 366 536 530 323 811 ... 350 959 592 651 487 406 895 53
## [["9"]] 318 478 991 747 769 626 332 436 548 ... 114 546 928 69 415 219 568 546
## [["10"]] 467 674 225 194 464 87 494 364 957 ... 706 267 938 259 685 700 435 937
## ...
## <90 more elements>
head(elementNROWS(irl)) # head(lengths(irl)) also works (with an "S" at the end of lengthS)
##   1   2   3   4   5   6 
##  96  83 108  95  84 110

3 intra- and inter-range operations

3.1 Intra-range operations

These operations apply on each range of an IRanges object:

ir[1:2]
## IRanges object with 2 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]       288       319        32
##   [2]       788       820        33
shift(ir[1:2],shift=10)
## IRanges object with 2 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]       298       329        32
##   [2]       798       830        33
resize(ir[1:2],width=100,fix="start")
## IRanges object with 2 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]       288       387       100
##   [2]       788       887       100
flank(ir[1:2],width=100,start=T)
## IRanges object with 2 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]       188       287       100
##   [2]       688       787       100
narrow(ir[1:2],start=1,width=30) #here 'start' is relative
## IRanges object with 2 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]       288       317        30
##   [2]       788       817        30
ir[1:2]+10
## IRanges object with 2 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]       278       329        52
##   [2]       778       830        53
ir[1:2]-10
## IRanges object with 2 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]       298       309        12
##   [2]       798       810        13
ir[1:2]+c(0,10)
## IRanges object with 2 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]       288       319        32
##   [2]       778       830        53
ir[1:4]*-10 ; ir[1:4]*10 # acts like a centered zoom
## IRanges object with 4 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]       144       463       320
##   [2]       639       968       330
##   [3]        13       892       880
##   [4]       734      1063       330
## IRanges object with 4 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]       302       304         3
##   [2]       803       805         3
##   [3]       449       456         8
##   [4]       898       900         3
ir[1:2]*c(1,2) #zoom second range by 2X
## IRanges object with 2 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]       288       319        32
##   [2]       796       811        16

See help('intra-range-methods',package="IRanges") for more.

3.2 Inter-range operations

To illustrate these operations, we first build a function to plot ranges (from the IRanges vignette):

plotRanges <- function(x,
                       xlim = x,
                       main = deparse(substitute(x)),
                       col = "black", sep = 0.5,
                       cextitle=1, cexaxis=1, xaxis=T,...)
{
  height <- 1
  if (is(xlim, "Ranges"))
    xlim <- c(min(start(xlim)), max(end(xlim)))
    bins <- disjointBins(IRanges(start(x), end(x) + 1))
    plot.new()
    plot.window(xlim, c(0, max(bins)*(height + sep)))
    ybottom <- bins * (sep + height) - height
    rect(start(x)-0.5, ybottom, end(x)+0.5, ybottom + height, col = col, ...)
  if (xaxis)
    (axis(1,cex.axis=cexaxis,padj=1))
  title(main,cex.main=cextitle)
}

These inter-range operations, called endomorphisms, apply on a set of ranges and return a set of ranges:

#select some ranges in [1:100]:
irs=ir[which(start(ir)<=100 & 
               end(ir)<=100)[c(3:4,8,14,17:18)]] 
irs
## IRanges object with 6 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]        25        44        20
##   [2]        46        60        15
##   [3]        53        65        13
##   [4]         9        33        25
##   [5]         1        50        50
##   [6]        87        96        10
reduce(irs)
## IRanges object with 2 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]         1        65        65
##   [2]        87        96        10
disjoin(irs)
## IRanges object with 10 ranges and 0 metadata columns:
##            start       end     width
##        <integer> <integer> <integer>
##    [1]         1         8         8
##    [2]         9        24        16
##    [3]        25        33         9
##    [4]        34        44        11
##    [5]        45        45         1
##    [6]        46        50         5
##    [7]        51        52         2
##    [8]        53        60         8
##    [9]        61        65         5
##   [10]        87        96        10
gaps(irs)
## IRanges object with 1 range and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]        66        86        21
coverage(irs)
## integer-Rle of length 96 with 11 runs
##   Lengths:  8 16  9 11  1  5  2  8  5 21 10
##   Values :  1  2  3  2  1  2  1  2  1  0  1

See help('inter-range-methods',package="IRanges") for other methods.

We can illustrate these methods with:

par(mfrow=c(5,1), mar=rep(3,4))
plotRanges(irs,xlim=c(0,100),xaxis=T, cextitle=2, cexaxis=1.5)
plotRanges(reduce(irs),xlim=c(0,100),xaxis=T, cextitle=2, cexaxis=1.5)
plotRanges(disjoin(irs),xlim=c(0,100),xaxis=T, cextitle=2, cexaxis=1.5)
plotRanges(gaps(irs),xlim=c(0,100),xaxis=T, cextitle=2, cexaxis=1.5)
plot(1:100,c(coverage(irs),rep(0,4)),type="l",axes=F,xlab="",ylab="",lwd=3)
title(main="coverage",cex.main=2)
axis(side=2,lwd=2,cex.axis=2,at=0:3,labels=0:3)
axis(1,lwd=2,cex.axis=2,padj=1)
Inter-range operations.

(#fig:fig_inter-range_ops)Inter-range operations.

3.3 Set operations

See help(’setops-methods’,package="IRanges") for details.
The union and intersect functions for IRanges:

union(irs[1:3],irs[4:6])
## IRanges object with 2 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]         1        65        65
##   [2]        87        96        10
intersect(irs[1:3],irs[4:6])
## IRanges object with 2 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]        25        44        20
##   [2]        46        50         5
par(mfrow=c(4,1))
 plotRanges(irs[1:3], xlim=c(0,100), xaxis=T, 
            cextitle=2, cexaxis=1.5)
 plotRanges(irs[4:6], xlim=c(0,100), xaxis=T, 
            cextitle=2, cexaxis=1.5)
 plotRanges(union(irs[1:3], irs[4:6]), xlim=c(0,100), xaxis=T, 
            cextitle=2, cexaxis=1.5)
 plotRanges(intersect(irs[1:3], irs[4:6]), xlim=c(0,100), xaxis=T, 
            cextitle=2, cexaxis=1.5)
Inter-range operations.

(#fig:fig_IRanges_setops_unionintersect)Inter-range operations.

The setdiff function for asymmetric differences:

setdiff(irs[1:3],irs[4:6])
## IRanges object with 1 range and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]        51        65        15
setdiff(irs[4:6],irs[1:3])
## IRanges object with 3 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]         1        24        24
##   [2]        45        45         1
##   [3]        87        96        10
par(mfrow=c(4,1))
 plotRanges(irs[1:3],xlim=c(0,100),xaxis=T, cextitle=2, cexaxis=1.5)
 plotRanges(irs[4:6],xlim=c(0,100),xaxis=T, cextitle=2, cexaxis=1.5)
 plotRanges(setdiff(irs[1:3],irs[4:6]),xlim=c(0,100),xaxis=T, 
            cextitle=2, cexaxis=1.5)
 plotRanges(setdiff(irs[4:6],irs[1:3]),xlim=c(0,100),xaxis=T, 
            cextitle=2, cexaxis=1.5)
Asymetric differences with setdiff on IRanges.

(#fig:fig_IRanges_setops_setdiff)Asymetric differences with setdiff on IRanges.

The same functions can be applied in a parallel fashion (i.e. on the first elements of the provided IRanges, then on the second elements, etc.)

punion(irs[1:2],irs[4:5]) #element-wise (aka "parallel") union
## IRanges object with 2 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]         9        44        36
##   [2]         1        60        60
pintersect(irs[1:2],irs[4:5])
## IRanges object with 2 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]        25        33         9
##   [2]        46        50         5
psetdiff(irs[1:3],irs[4:6]) # asymmetric! difference
## IRanges object with 3 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]        34        44        11
##   [2]        51        60        10
##   [3]        53        65        13
pgap(irs[1:3],irs[4:6])
## IRanges object with 3 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]        34        33         0
##   [2]        51        50         0
##   [3]        66        86        21

3.4 Nearest methods

See help(’nearest-methods’,package="IRanges") for details and examples:

nearest(irs[4:6],irs[1:3])
## [1] 1 1 3
distance(irs[4:6],irs[1:3])
## [1]  0  0 21

We will se other examples on GRanges objects in paragraph 6.4

3.5 Between ranges operations

These are mainly methods to find overlaps between ranges.
Examples of using these functions on GRanges objects are provided in paragraph 6.5
See also help(’findOverlaps-methods’,package="IRanges") for details.

4 Run-length encoding (Rle)

Run-length encoding is a data compression method highly adapted to long vectors containing repeated values (e.g. coverage on a whole chromosome).
For example, the sequence \(\{1,1,1,1,1,1,2,3,3\}\) can be represented as:

  • \(values=\{1,2,3\}\)
    and the paired
  • \(run lengths=\{6,1,2\}\).

In IRanges, the Rle class is used to represent run-length encoded (compressed) atomic vectors.

Rle objects:

set.seed(123) #for reproducibility
lambda = c(rep(0.001, 3500), #From IRanges vignette
           seq(0.001, 10, length = 500), 
            seq(10, 0.001, length = 500))
xRle=Rle(rpois(1e4, lambda))
yRle=Rle(rpois(1e4, lambda[c(251:length(lambda), 1:250)]))

xRle
## integer-Rle of length 10000 with 1630 runs
##   Lengths:  326    1 1150    1  772    1  395 ...    1  170    1  466    1  262
##   Values :    0    1    0    1    0    1    0 ...    1    0    1    0    1    0
yRle
## integer-Rle of length 10000 with 1616 runs
##   Lengths: 1984    1 1268    1   15    1    1 ...    1    1    1    8    1 1264
##   Values :    0    1    0    1    0    1    0 ...    1    0    1    0    1    0

as.vector(object.size(xRle)/object.size(as.vector(xRle))) #Gain of memory
## [1] 0.3545745

head(runValue(xRle))
## [1] 0 1 0 1 0 1
head(runLength(xRle))
## [1]  326    1 1150    1  772    1

head(start(xRle)) #starts of the runs
## [1]    1  327  328 1478 1479 2251
head(end(xRle)) #ends of the runs
## [1]  326  327 1477 1478 2250 2251

nrun(xRle) #number of runs
## [1] 1630
findRun(as.integer(c(100,200,300,1200)),xRle)
## [1] 1 1 1 3
coverage(irs)
## integer-Rle of length 96 with 11 runs
##   Lengths:  8 16  9 11  1  5  2  8  5 21 10
##   Values :  1  2  3  2  1  2  1  2  1  0  1

These objects support the basic methods associated with R atomic vectors:

xRle + yRle
## integer-Rle of length 10000 with 2111 runs
##   Lengths:  326    1 1150    1  506    1  265 ...    1  170    1  466    1  262
##   Values :    0    1    0    1    0    1    0 ...    1    0    1    0    1    0
xRle > 0
## logical-Rle of length 10000 with 211 runs
##   Lengths:   326     1  1150     1   772 ...   170     1   466     1   262
##   Values : FALSE  TRUE FALSE  TRUE FALSE ... FALSE  TRUE FALSE  TRUE FALSE
xRle > yRle
## logical-Rle of length 10000 with 367 runs
##   Lengths:   326     1  1150     1   772 ...   170     1   466     1   262
##   Values : FALSE  TRUE FALSE  TRUE FALSE ... FALSE  TRUE FALSE  TRUE FALSE
max(xRle)
## [1] 18
summary(xRle)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.9932  0.0000 18.0000
sqrt(xRle)
## numeric-Rle of length 10000 with 1630 runs
##   Lengths:     326       1    1150       1 ...       1     466       1     262
##   Values : 0.00000 1.00000 0.00000 1.00000 ... 1.00000 0.00000 1.00000 0.00000
rev(xRle)
## integer-Rle of length 10000 with 1630 runs
##   Lengths:  262    1  466    1  170    1  105 ...    1  772    1 1150    1  326
##   Values :    0    1    0    1    0    1    0 ...    1    0    1    0    1    0
table(xRle)
## xRle
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 8200  201  198  202  212  184  167  146  140  101   89   49   51   25   12    6 
##   16   17   18 
##    7    8    2
union(xRle, yRle)
## integer-Rle of length 20 with 20 runs
##   Lengths:  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   Values :  0  1  3  2  4  5  6  7  8  9 11 12 10 13 17 14 16 15 18 22
cor(xRle, yRle)
## [1] 0.5690394

See ?’Rle-class’ and ?’Rle-utils’ for other methods.

There are useful functions to perform fixed-width running window summaries:

runmean(xRle, k=100) # See ?'Rle-runstat' for other examples
## numeric-Rle of length 9901 with 1852 runs
##   Lengths:  227  100 1051  100  673  100  296 ...  100   71  100  367  100  163
##   Values : 0.00 0.01 0.00 0.01 0.00 0.01 0.00 ... 0.01 0.00 0.01 0.00 0.01 0.00
#same result, more flexible but much slower:
Rle(aggregate(xRle, start = 1:(length(xRle)-99), width = 100, FUN = mean))
## numeric-Rle of length 9901 with 1852 runs
##   Lengths:  227  100 1051  100  673  100  296 ...  100   71  100  367  100  163
##   Values : 0.00 0.01 0.00 0.01 0.00 0.01 0.00 ... 0.01 0.00 0.01 0.00 0.01 0.00
runq(xRle, k=100, i=10) #10th smallest value in windows of 100
## integer-Rle of length 9901 with 41 runs
##   Lengths: 3558    7    2    6    4  102    3 ...    2    2   80   59   53 1090
##   Values :    0    1    0    1    0    1    2 ...    3    4    3    2    1    0

One typical application of Rle objects is to store the coverage of NGS reads along a chromosome.
The coverage for all chromosomes can be stored in an RleList object (see ?AtomicList for details) in which each chromosome will be an Rle element of the list.
Most functions defined for Rle objects also work on RleList objects.

Practically, any variable defined along a genome can be represented as

  • an RleList with one Rle for each chromosome or
  • the mcols (metadata columns) of a GRanges object (see below)

Sometimes, it is desirable to manipulate several of these variables in the same object (e.g. for plotting with Gviz).
The ?genomicvars help page provides useful functions such as bindAsGRanges and mcolAsRleList to switch from one representation to the other.

5 Views

As we mentioned when looking at the Biostrings package, Views objects are used to store a sequence (a Vector object called the “subject”) together with a set of ranges which define views onto the sequence.
Specific subclasses exist for different classes of “subject” Vectors, such as RleViews from the IRanges package and XStringViews from the Biostrings package.

An RleViews object stores an Rle subject and its views:

coverage(irs)
## integer-Rle of length 96 with 11 runs
##   Lengths:  8 16  9 11  1  5  2  8  5 21 10
##   Values :  1  2  3  2  1  2  1  2  1  0  1

irs_views=Views(coverage(irs),start=c(-5,10,20,90),end=c(10,30,50,100))
irs_views #Views can be out of bound
## Views on a 96-length Rle subject
## 
## views:
##     start end width
## [1]    -5  10    16 [1 1 1 1 1 1 1 1 2 2 ...]
## [2]    10  30    21 [2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3]
## [3]    20  50    31 [2 2 2 2 2 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 1 ...]
## [4]    90 100    11 [1 1 1 1 1 1 1 ...]
try(irs_views[[1]]) #but can't be extracted
## Error in getListElement(x, i, ...) : view is out of limits

irs_views[[2]]
## integer-Rle of length 21 with 2 runs
##   Lengths: 15  6
##   Values :  2  3

start(irs_views)
## [1] -5 10 20 90

See ?’RleViews-class’ for details.

Other ways to create Views:

Views(coverage(irs),irs[c(1,2,6)]) #use an IRanges to extract Views
## Views on a 96-length Rle subject
## 
## views:
##     start end width
## [1]    25  44    20 [3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2]
## [2]    46  60    15 [2 2 2 2 2 1 1 2 2 2 2 2 2 2 2]
## [3]    87  96    10 [1 1 1 1 1 1 1 1 1 1]
Views(coverage(irs),coverage(irs)>=1) #or a logical Rle
## Views on a 96-length Rle subject
## 
## views:
##     start end width
## [1]     1  65    65 [1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 ...]
## [2]    87  96    10 [1 1 1 1 1 1 1 1 1 1]
slice(coverage(irs), 3) #use slice
## Views on a 96-length Rle subject
## 
## views:
##     start end width
## [1]    25  33     9 [3 3 3 3 3 3 3 3 3]
successiveViews(coverage(irs),width=rep(20,4)) #get successive Views
## Views on a 96-length Rle subject
## 
## views:
##     start end width
## [1]     1  20    20 [1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2]
## [2]    21  40    20 [2 2 2 2 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2]
## [3]    41  60    20 [2 2 2 2 1 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2]
## [4]    61  80    20 [1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

Furthermore there is a ViewsList virtual class. Its specialized subclass RleViewsList is useful to store coverage vectors along with their specific views over a set of chromosomes.

xyRleList=RleList(xRle,yRle)
xyRleList_views=Views(xyRleList,
                      IRangesList(IRanges(start=c(3700,4000),width=20),
                                  IRanges(yRle>17)+2))
xyRleList_views[[1]]
## Views on a 10000-length Rle subject
## 
## views:
##     start  end width
## [1]  3700 3719    20 [1 0 2 7 2 9 5 3 3 6 4 4 3 7 5 5 3 5 8 1]
## [2]  4000 4019    20 [ 9 13 13  8  7 14  3 10 16  8 17  5  8  8 15 17 12 ...]
xyRleList_views[[2]]
## Views on a 10000-length Rle subject
## 
## views:
##     start  end width
## [1]  3750 3754     5 [10 13 18 12 14]
## [2]  8250 8254     5 [12  8 22 10  9]
width(xyRleList_views)
## IntegerList of length 2
## [[1]] 20 20
## [[2]] 5 5

Specific functions are provided for fast looping over Views and ViewsList objects:

viewMins(irs_views) #same as min(irs_views)
## [1] 1 2 1 1
viewSums(irs_views) #same as sum(irs_views)
## [1] 12 48 70  7
viewWhichMaxs(irs_views) #get the (first) coordinate of viewMaxs 
## [1]  9 25 25 90
# (which.max also works)
viewRangeMins(irs_views) #get the (first) range of viewMins
## IRanges object with 4 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]         1         8         8
##   [2]         9        24        16
##   [3]        45        45         1
##   [4]        87        96        10
viewApply(irs_views,sd)
## [1] 0.4216370 0.4629100 0.5143113 0.0000000
viewMeans(xyRleList_views)
## NumericList of length 2
## [[1]] 4.15 10.35
## [[2]] 13.4 12.2

Note that the min , max , sum , mean , which.min and which.max functions also work on Views (but not on RleViewsList yet).
The corresponding view* functions might be deprecated in the future.
See ?'view-summarization-methods' for details.

6 GenomicRanges

6.1 GRanges objects

Here, we will present some of the classes and functions defined in the GenomicRanges package.
This package is central to Bioconductor users who analyze NGS data, so you should consider reading thoroughly the excellent vignettes associated with this package.

The main class defined in GenomicRanges is the GRanges class which acts as a container for genomic locations and their associated annotations (see ?GRanges).

GRanges (and GRangesList) build on IRanges (and IRangesList respectively) with the following specificities:

  • The informations on seqnames (typically chromosomes) and strand is stored along with the information on ranges (SEW)
  • An optional (but recommended) seqinfo slot contains information on the sequences: names, length (seqlengths), circularity and genome
  • Optional metadata columns (mcols) containing additional information on each range (e.g. score, GC content, etc.) which are stored as a DataFrame

By convention, in Bioconductor genomic coordinates:

  • are 1-based
  • are left-most, i.e. ‘start’ of ranges on the minus strand are the left-most coordinate, rather than the 5’ coordinate
  • represent closed intervals, i.e. the intervals contain start and end coordinates

Here is the general structure of a GRanges object:

Structure of a GRanges object It contains:

  • seqnames are the chromosome/contigs on which the intervals are defined. Stored as an Rle
  • ranges are the intervals themselves. Stored as an IRanges object
  • strand the strand on which the intervals are located (“+”, “-” or “*” for undefined strand). Stored as an Rle
  • Optional metadata columns. Stored as a DataFrame object (defined in the S4Vectors package)
  • Optional (recommended) seqinfo which contains the following information: name, length, circularity and genome on the seqnames. Stored as a Seqinfo obect (defined in the GenomeInfoDb package)

It is important to understand that, despite looking like a table, GRanges object are vector-like, or rather list-like objects.
They have a length (the number of intervals) but do not have a dim (or nrow / ncol).

The GRanges function can be used to create a GRanges object:

genes = GRanges(seqnames   = c("chr2L", "chrX"),
                ranges     = IRanges(start = c(7529, 18962306),
                                     end = c(9484, 18962925),
                                     names = c("FBgn0031208", "FBgn0085359")),
                strand     = c("+", "-"),
                seqlengths = c(chr2L=23011544L, chrX=22422827L))

slotNames(genes)
## [1] "seqnames"        "ranges"          "strand"          "seqinfo"        
## [5] "elementMetadata" "elementType"     "metadata"

mcols(genes) = DataFrame(EntrezId = c("33155", "2768869"),
                         Symbol = c("CG11023", "CG34330"))
genome(genes)="dm3" #see ?seqinfo for details
genes
## GRanges object with 2 ranges and 2 metadata columns:
##               seqnames            ranges strand |    EntrezId      Symbol
##                  <Rle>         <IRanges>  <Rle> | <character> <character>
##   FBgn0031208    chr2L         7529-9484      + |       33155     CG11023
##   FBgn0085359     chrX 18962306-18962925      - |     2768869     CG34330
##   -------
##   seqinfo: 2 sequences from dm3 genome

Note that there are different ways to define a GRanges object.
For example, you can also use a genome browser-style syntax to provide the genomic intervals:

GRanges(c("Chr1:1-100:+", 
          "Chr2:150..200:-"))
## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     Chr1     1-100      +
##   [2]     Chr2   150-200      -
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

The GRanges accessors include IRanges accessors and others:

width(genes)
## [1] 1956  620
names(genes)
## [1] "FBgn0031208" "FBgn0085359"
seqnames(genes)
## factor-Rle of length 2 with 2 runs
##   Lengths:     1     1
##   Values : chr2L chrX 
## Levels(2): chr2L chrX
strand(genes)
## factor-Rle of length 2 with 2 runs
##   Lengths: 1 1
##   Values : + -
## Levels(3): + - *
ranges(genes)
## IRanges object with 2 ranges and 0 metadata columns:
##                   start       end     width
##               <integer> <integer> <integer>
##   FBgn0031208      7529      9484      1956
##   FBgn0085359  18962306  18962925       620
genes$Symbol
## [1] "CG11023" "CG34330"
mcols(genes)
## DataFrame with 2 rows and 2 columns
##                EntrezId      Symbol
##             <character> <character>
## FBgn0031208       33155     CG11023
## FBgn0085359     2768869     CG34330
seqinfo(genes)
## Seqinfo object with 2 sequences from dm3 genome:
##   seqnames seqlengths isCircular genome
##   chr2L      23011544         NA    dm3
##   chrX       22422827         NA    dm3
seqlevels(genes)
## [1] "chr2L" "chrX"

6.2 GRangesList

As for IRangesList, there is a GRangesList class which allows to store GRanges in a list-type object.
This is typically used to store e.g. GRanges defining exons arranged by transcripts or genes or GRanges defining transcripts arranged by genes.
The vignette of the GenomicRanges package provides a good example of 2 transcripts, one of which has 2 exons:

gr1 = GRanges(seqnames = "chr2", ranges = IRanges(3, 6),
              strand = "+", score = 5L, GC = 0.45)
gr2 = GRanges(seqnames = c("chr1", "chr1"),
              ranges = IRanges(c(7, 13), width = 3),
              strand = c("+", "-"), score = 3:4, GC = c(0.3, 0.5))
grl = GRangesList("txA" = gr1, "txB" = gr2)
grl
## GRangesList object of length 2:
## $txA
## GRanges object with 1 range and 2 metadata columns:
##       seqnames    ranges strand |     score        GC
##          <Rle> <IRanges>  <Rle> | <integer> <numeric>
##   [1]     chr2       3-6      + |         5      0.45
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths
## 
## $txB
## GRanges object with 2 ranges and 2 metadata columns:
##       seqnames    ranges strand |     score        GC
##          <Rle> <IRanges>  <Rle> | <integer> <numeric>
##   [1]     chr1       7-9      + |         3       0.3
##   [2]     chr1     13-15      - |         4       0.5
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths
length(grl)
## [1] 2
elementNROWS(grl) #or lengths(grl)
## txA txB 
##   1   2
grl["txB","GC"]
## GRangesList object of length 1:
## $txB
## GRanges object with 2 ranges and 1 metadata column:
##       seqnames    ranges strand |        GC
##          <Rle> <IRanges>  <Rle> | <numeric>
##   [1]     chr1       7-9      + |       0.3
##   [2]     chr1     13-15      - |       0.5
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths
unlist(grl)
## GRanges object with 3 ranges and 2 metadata columns:
##       seqnames    ranges strand |     score        GC
##          <Rle> <IRanges>  <Rle> | <integer> <numeric>
##   txA     chr2       3-6      + |         5      0.45
##   txB     chr1       7-9      + |         3      0.30
##   txB     chr1     13-15      - |         4      0.50
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

Please refer to the introductory vignette for further examples.
Most accessors and functions defined for GRanges also work on GRangesList.
However, note that mcols(grl) now refers to metadata at the list level rather than at level of the individual GRanges objects.

6.3 Annotations as GenomicRanges: TxDb* packages

Here, we will briefly present the Bioconductor annotation packages which can provide genome-wide annotations directly as GRanges and GRangesList objects.
These packages, called TxDb* are databases that contain the annotation from a given organism (i.e. intervals representing genes, transcripts, exons, CDS, etc.).
They can be built using the txdbmaker package, but TxDb packages are already available for many organisms and genome versions (see the corresponding BioCViews)
The functions to query these objects are defined in the GenomicFeatures package.
Here, we illustrate how to extract information from the TxDb.Dmelanogaster.UCSC.dm3.ensGene package. Let’s load the package

library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)

Loading the package creates/loads the object/database TxDb.Dmelanogaster.UCSC.dm3.ensGene

A GRanges object with the genomic coordinates for the genes can be obtained using:

Dmg=genes(TxDb.Dmelanogaster.UCSC.dm3.ensGene, single.strand.genes.only=T)
Dmg
## GRanges object with 15682 ranges and 1 metadata column:
##               seqnames            ranges strand |     gene_id
##                  <Rle>         <IRanges>  <Rle> | <character>
##   FBgn0000003    chr3R   2648220-2648518      + | FBgn0000003
##   FBgn0000008    chr2R 18024494-18060346      + | FBgn0000008
##   FBgn0000014    chr3R 12632936-12655767      - | FBgn0000014
##   FBgn0000015    chr3R 12752932-12797958      - | FBgn0000015
##   FBgn0000017    chr3L 16615470-16640982      - | FBgn0000017
##           ...      ...               ...    ... .         ...
##   FBgn0264723    chr3L 12238610-12239148      - | FBgn0264723
##   FBgn0264724    chr3L 15327882-15329271      + | FBgn0264724
##   FBgn0264725    chr3L 12025657-12026099      + | FBgn0264725
##   FBgn0264726    chr3L 12020901-12021253      + | FBgn0264726
##   FBgn0264727    chr3L 22065469-22065720      + | FBgn0264727
##   -------
##   seqinfo: 15 sequences (1 circular) from dm3 genome

Other functions, such as transcripts, exons, cds, promoters, terminators, microRNAs and tRNAs defined in the GenomiFeatures package allow to extract the corresponding genomic features.

A GRangesList object with the coordinates of the transcripts arranged by genes can be obtained using:

Dmt=transcriptsBy(TxDb.Dmelanogaster.UCSC.dm3.ensGene, by="gene")
Dmt
## GRangesList object of length 15682:
## $FBgn0000003
## GRanges object with 1 range and 2 metadata columns:
##       seqnames          ranges strand |     tx_id     tx_name
##          <Rle>       <IRanges>  <Rle> | <integer> <character>
##   [1]    chr3R 2648220-2648518      + |     17345 FBtr0081624
##   -------
##   seqinfo: 15 sequences (1 circular) from dm3 genome
## 
## $FBgn0000008
## GRanges object with 3 ranges and 2 metadata columns:
##       seqnames            ranges strand |     tx_id     tx_name
##          <Rle>         <IRanges>  <Rle> | <integer> <character>
##   [1]    chr2R 18024494-18060339      + |      7681 FBtr0100521
##   [2]    chr2R 18024496-18060346      + |      7682 FBtr0071763
##   [3]    chr2R 18024938-18060346      + |      7683 FBtr0071764
##   -------
##   seqinfo: 15 sequences (1 circular) from dm3 genome
## 
## $FBgn0000014
## GRanges object with 4 ranges and 2 metadata columns:
##       seqnames            ranges strand |     tx_id     tx_name
##          <Rle>         <IRanges>  <Rle> | <integer> <character>
##   [1]    chr3R 12632936-12655767      - |     21863 FBtr0306337
##   [2]    chr3R 12633349-12653845      - |     21864 FBtr0083388
##   [3]    chr3R 12633349-12655300      - |     21865 FBtr0083387
##   [4]    chr3R 12633349-12655474      - |     21866 FBtr0300485
##   -------
##   seqinfo: 15 sequences (1 circular) from dm3 genome
## 
## ...
## <15679 more elements>

The functions exonsBy, cdsBy, intronsByTranscript, fiveUTRsByTranscript, threeUTRsByTranscript allow to extract the corresponding genomic features arranged in a GRangesList object.

Other functions such as transcriptsByOverlaps and exonsByOverlaps allow to extract genomic features for genomic locations specified by a GRanges object:

exonsByOverlaps(TxDb.Dmelanogaster.UCSC.dm3.ensGene,genes)
## GRanges object with 6 ranges and 1 metadata column:
##       seqnames            ranges strand |   exon_id
##          <Rle>         <IRanges>  <Rle> | <integer>
##   [1]    chr2L         7529-8116      + |         1
##   [2]    chr2L         8193-8589      + |         2
##   [3]    chr2L         8193-9484      + |         3
##   [4]    chr2L         8229-9484      + |         4
##   [5]    chr2L         8668-9484      + |         5
##   [6]     chrX 18962306-18962925      - |     75268
##   -------
##   seqinfo: 15 sequences (1 circular) from dm3 genome

When a TxDb package is paired with an appropriate BSgenome object, it is relatively straightforward to extract DNA sequences providing a GRanges or a GRangesList

getSeq(BSgenome.Dmelanogaster.UCSC.dm3, Dmg[1:2])
## DNAStringSet object of length 2:
##     width seq                                               names               
## [1]   299 TCGACTGGAAGGTTGGCAGCTTC...CCGATATGGTTGGACCACAATCT FBgn0000003
## [2] 35853 CGCGGCGGTCGCATCGGAGTCGA...TATTTACCTGAAAAGCAATATAC FBgn0000008
Dmc = cdsBy(TxDb.Dmelanogaster.UCSC.dm3.ensGene, by="tx")
cds_seq = extractTranscriptSeqs(BSgenome.Dmelanogaster.UCSC.dm3, Dmc[1:2])
cds_seq
## DNAStringSet object of length 2:
##     width seq                                               names               
## [1]   855 ATGGGCGAGCGGGATCAGCCACA...CGGTATGGCAACGAATATATTGA 1
## [2]  1443 ATGGGCGAGCGGGATCAGCCACA...AAATCGTCGACGGAGAGTTGTGA 2
translate(cds_seq)
## AAStringSet object of length 2:
##     width seq                                               names               
## [1]   285 MGERDQPQSSERISIFNPPVYTQ...EHHDRFNEITQDDKSTVWQRIY* 1
## [2]   481 MGERDQPQSSERISIFNPPVYTQ...FTQSEMLYFRKKMALEIVDGEL* 2

There are several other useful functions defined in the GenomiFeatures package, such as:

  • transcriptLengths to extract different metrics on the transcripts (length, number of exons, length of 5’ and 3’ UTRs)
  • exonicParts and intronicParts to extract non-overlapping exonic/intronic parts
  • extendExonsIntoIntrons to extend exons into their adjacent introns
  • coverageByTranscript to get the coverage by transcript (or CDS) of a set of ranges

The select function also works to extract specific info from a TxDb objects.
See help("select-methods", package = "GenomicFeatures") and the paragraph 8 on Annotation packages.

6.4 GRanges methods

Most if not all functions defined for IRanges objects are also defined for GRanges and GRangesList objects and they are generally seqnames- and strand-aware.
These include:

  • intra-range operations (shift, etc. ; see paragraph 3.1 and help('intra-range-methods',"GenomicRanges")
  • inter-range operations (reduce, etc. ; see paragraph 3.2 and help('inter-range-methods',"GenomicRanges")
  • set operations (see paragraph 3.3 and help('setops-methods',"GenomicRanges")
  • nearest methods (nearest, etc. ; see paragraph 3.4 and help('nearest-methods',"GenomicRanges")
  • between ranges (“overlaps”) operations presented below in paragraph 6.5

Here, we briefly illustrate some of these functions:

genes
## GRanges object with 2 ranges and 2 metadata columns:
##               seqnames            ranges strand |    EntrezId      Symbol
##                  <Rle>         <IRanges>  <Rle> | <character> <character>
##   FBgn0031208    chr2L         7529-9484      + |       33155     CG11023
##   FBgn0085359     chrX 18962306-18962925      - |     2768869     CG34330
##   -------
##   seqinfo: 2 sequences from dm3 genome
genes2=Dmg[c(1:2,21:22,36:37)]
genes2
## GRanges object with 6 ranges and 1 metadata column:
##               seqnames            ranges strand |     gene_id
##                  <Rle>         <IRanges>  <Rle> | <character>
##   FBgn0000003    chr3R   2648220-2648518      + | FBgn0000003
##   FBgn0000008    chr2R 18024494-18060346      + | FBgn0000008
##   FBgn0000052    chr2L   6041178-6045970      - | FBgn0000052
##   FBgn0000053    chr2L   7014861-7023940      - | FBgn0000053
##   FBgn0000084     chrX 20065478-20067552      - | FBgn0000084
##   FBgn0000092     chrX   2586765-2587919      - | FBgn0000092
##   -------
##   seqinfo: 15 sequences (1 circular) from dm3 genome
sort(genes2)
## GRanges object with 6 ranges and 1 metadata column:
##               seqnames            ranges strand |     gene_id
##                  <Rle>         <IRanges>  <Rle> | <character>
##   FBgn0000052    chr2L   6041178-6045970      - | FBgn0000052
##   FBgn0000053    chr2L   7014861-7023940      - | FBgn0000053
##   FBgn0000008    chr2R 18024494-18060346      + | FBgn0000008
##   FBgn0000003    chr3R   2648220-2648518      + | FBgn0000003
##   FBgn0000092     chrX   2586765-2587919      - | FBgn0000092
##   FBgn0000084     chrX 20065478-20067552      - | FBgn0000084
##   -------
##   seqinfo: 15 sequences (1 circular) from dm3 genome
c(genes,genes2,ignore.mcols=T) #combine
## GRanges object with 8 ranges and 0 metadata columns:
##               seqnames            ranges strand
##                  <Rle>         <IRanges>  <Rle>
##   FBgn0031208    chr2L         7529-9484      +
##   FBgn0085359     chrX 18962306-18962925      -
##   FBgn0000003    chr3R   2648220-2648518      +
##   FBgn0000008    chr2R 18024494-18060346      +
##   FBgn0000052    chr2L   6041178-6045970      -
##   FBgn0000053    chr2L   7014861-7023940      -
##   FBgn0000084     chrX 20065478-20067552      -
##   FBgn0000092     chrX   2586765-2587919      -
##   -------
##   seqinfo: 15 sequences (1 circular) from dm3 genome
intersect(genes,c(genes2,Dmg['FBgn0031208'])) #set operations
## GRanges object with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]    chr2L 7529-9484      +
##   -------
##   seqinfo: 15 sequences (1 circular) from dm3 genome
nearest(genes,genes2) #nearest-methods
## [1] NA  5
nearest(genes,genes2,ignore.strand=T) #strand-aware by default
## [1] 3 5
precede(genes,genes2,ignore.strand=T)
## [1] 3 5
promoters(genes,upstream=200,downstream=1) #intra-range operations
## GRanges object with 2 ranges and 2 metadata columns:
##               seqnames            ranges strand |    EntrezId      Symbol
##                  <Rle>         <IRanges>  <Rle> | <character> <character>
##   FBgn0031208    chr2L         7329-7529      + |       33155     CG11023
##   FBgn0085359     chrX 18962925-18963125      - |     2768869     CG34330
##   -------
##   seqinfo: 2 sequences from dm3 genome
reduce(Dmt[[2]]) #inter-range operations
## GRanges object with 1 range and 0 metadata columns:
##       seqnames            ranges strand
##          <Rle>         <IRanges>  <Rle>
##   [1]    chr2R 18024494-18060346      +
##   -------
##   seqinfo: 15 sequences (1 circular) from dm3 genome

See also help('GenomicRanges-comparison',"GenomicRanges") for other functions for comparing GenomicRanges.

6.5 Overlaps between ranges

A very common task on genomic ranges is to search for overlaps between sets of genomic ranges which corresponds to an operation between ranges.
These functions are defined for several classes of objects, including IRanges, GRanges and their list-type counterparts IRangesList and GRangesList but also Views and ViewsList among others.
Details on function definitions can be found using ?'findOverlaps-methods'.

As first example, let’s count how many of the transcription start sites (TSS) in the Drosophila melanogaster genome are located at more than 500bp from another gene:

Dm_tss = unlist(reduce(promoters(Dmt, upstream = 0, downstream = 1))) #get all TSS
cov_tss_g500 = countOverlaps(Dm_tss, Dmg+500) #strand-aware!
table(cov_tss_g500)
## cov_tss_g500
##     1     2     3     4     5     6     7     8     9 
## 15586  4488   490   116    65    47    18     6     3
sum(cov_tss_g500>1)
## [1] 5233
cov_tss_g500_bs = countOverlaps(Dm_tss, Dmg+500, ignore.strand=TRUE) #both strands
sum(cov_tss_g500_bs>1)
## [1] 10768

Getting the corresponding overlaps with findOverlaps:

fov_tss_g500_bs=findOverlaps(Dm_tss,Dmg+500,ignore.strand=T)
Dmg[c(1,1383)]
## GRanges object with 2 ranges and 1 metadata column:
##               seqnames          ranges strand |     gene_id
##                  <Rle>       <IRanges>  <Rle> | <character>
##   FBgn0000003    chr3R 2648220-2648518      + | FBgn0000003
##   FBgn0011904    chr3R 2648685-2648757      - | FBgn0011904
##   -------
##   seqinfo: 15 sequences (1 circular) from dm3 genome

Now, imagine we have a set of 10K NGS reads:

set.seed(0)
randomreads2L=GRanges(seqnames="chr2L",
                    ranges=IRanges(start=floor(runif(10000,5000,50000)),
                                   width=100),
                    strand="*")

And we want to select only the reads overlapping with these two genes:

sort(Dmg)[1:2]
## GRanges object with 2 ranges and 1 metadata column:
##               seqnames      ranges strand |     gene_id
##                  <Rle>   <IRanges>  <Rle> | <character>
##   FBgn0031208    chr2L   7529-9484      + | FBgn0031208
##   FBgn0263584    chr2L 21952-24237      + | FBgn0263584
##   -------
##   seqinfo: 15 sequences (1 circular) from dm3 genome

We could use:

subsetByOverlaps(randomreads2L, sort(Dmg)[1:2])
## GRanges object with 1017 ranges and 0 metadata columns:
##          seqnames      ranges strand
##             <Rle>   <IRanges>  <Rle>
##      [1]    chr2L   7780-7879      *
##      [2]    chr2L 22284-22383      *
##      [3]    chr2L 22101-22200      *
##      [4]    chr2L 22375-22474      *
##      [5]    chr2L 22207-22306      *
##      ...      ...         ...    ...
##   [1013]    chr2L 22674-22773      *
##   [1014]    chr2L 23550-23649      *
##   [1015]    chr2L 22940-23039      *
##   [1016]    chr2L 22316-22415      *
##   [1017]    chr2L   8623-8722      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

or:

randomreads2L[overlapsAny(randomreads2L, sort(Dmg)[1:2])]
## GRanges object with 1017 ranges and 0 metadata columns:
##          seqnames      ranges strand
##             <Rle>   <IRanges>  <Rle>
##      [1]    chr2L   7780-7879      *
##      [2]    chr2L 22284-22383      *
##      [3]    chr2L 22101-22200      *
##      [4]    chr2L 22375-22474      *
##      [5]    chr2L 22207-22306      *
##      ...      ...         ...    ...
##   [1013]    chr2L 22674-22773      *
##   [1014]    chr2L 23550-23649      *
##   [1015]    chr2L 22940-23039      *
##   [1016]    chr2L 22316-22415      *
##   [1017]    chr2L   8623-8722      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

To count the number of reads overlapping with these 2 genes we use the GenomicAlignments and the SummarizedExperiment packages that are further detailed in the paragraph 7.3 below:

assays(summarizeOverlaps(sort(Dmg)[1:2], randomreads2L, mode="Union"))$counts
##             reads
## FBgn0031208   465
## FBgn0263584   552

7 GenomicAlignments

Once NGS reads have been aligned to a reference genome, we obtain a sam/bam file containing the results of the alignment.
A SAM file contains the same information as the corresponding BAM file but BAM is a binary format and thus its size is reduced and its content is not human-readable.
Outside of R, these files are often manipulated with samtools or Picard tools for example.
A nice explanation of what SAM/BAM files contain can be found here.

Here, we will mainly illustrate the following libraries:

  • Rsamtools which provides an interface to samtools, bcftools and tabix (see htslib)
  • GenomicAlignments which provides efficient tools to manipulate short genomic alignments

Examples of BAM files from single- and paired-read sequencing are provided in the pasillaBamSubset.

We load the libraries if necessary:

require(SummarizedExperiment, quietly = TRUE)
require(GenomicAlignments, quietly = TRUE)
require(Rsamtools, quietly = TRUE)
require(pasillaBamSubset, quietly = TRUE)

7.1 Importing alignments from a BAM file

We will see a first example of import using the Rsamtools package for info but most of the time you will want to use the GenomicAlignment package instead because it generates objects very similar to GRanges.

7.1.1 Single-end reads

The path for the BAM file is obtained with:

sr_bamFile=untreated1_chr4() # from passilaBamSubset package

If the BAM file was not indexed yet (i.e. a .bai file present in the same directory and named as the .bam file), we could build such an index using:

indexBam(sr_bamFile) #do not run, the file is actually already indexed

Note that a number of other functions are available to manipulate SAM/BAM files, such as asSam/asBam, sortBam or mergeBam (See ?scanBam for details) but these manipulations are often done directly with samtools, outside of R.

Now, we could define:

  • which regions of the genome we would like to import (‘which’} (note that here only chr4 is available in the BAM file but for the example we try to extract data from chr2L also)
  • which columns of the BAM file we would like to import (‘what’)
  • and possibly filters for unwanted reads (‘flag’). These informations are stored in a ScanBamParam object:
#which is an IRangesList so there is not strand information:
which = IRangesList("chr2L"=IRanges(7000,10000),
                    "chr4"=IRanges(c(75000,1190000),c(85000,1203000)))

scanBamWhat() #available fields
##  [1] "qname"       "flag"        "rname"       "strand"      "pos"        
##  [6] "qwidth"      "mapq"        "cigar"       "mrnm"        "mpos"       
## [11] "isize"       "seq"         "qual"        "groupid"     "mate_status"
what = c("rname","strand","pos","qwidth","seq")
flag=scanBamFlag(isDuplicate=FALSE)
param=ScanBamParam(which=which,what=what,flag=flag)

See ?ScanBamParam for other options and examples.
As briefly illustrated, it is easy to define filters, e.g. based on SAM flags using scanBamFlag or on tags present in the SAM/BAM files, when importing your BAM files.

Now, we use scanBam function from Rsamtools to import the data in R:

mysrbam=scanBam(sr_bamFile,param=param)
class(mysrbam)
## [1] "list"
names(mysrbam)
## [1] "chr2L:7000-10000"     "chr4:75000-85000"     "chr4:1190000-1203000"
sapply(mysrbam, lengths)["rname",] #number of imported reads
##     chr2L:7000-10000     chr4:75000-85000 chr4:1190000-1203000 
##                    0                  304                  736

The resulting object is a list which is not always easy to manipulate for downstream applications so below we will rather illustrate the use of the GenomicAlignments package.
Nevertheless, Rsamtools may be useful if you want e.g. to work on a specific genomic region across several BAM files (see help(BamViews, package = Rsamtools)) or work on a very large BAM file in chunks (see help(BamFile, package = "Rsamtools")

So let’s see how to import our BAM file using the readGAlignments from the GenomicAlignments package:

mysrbam2=readGAlignments(sr_bamFile,
                         param=ScanBamParam(which=which,
                                            what="seq",
                                            flag=flag))
mysrbam2[1:2]
## GAlignments object with 2 alignments and 1 metadata column:
##       seqnames strand             cigar    qwidth     start       end     width
##          <Rle>  <Rle>       <character> <integer> <integer> <integer> <integer>
##   [1]     chr4      + 21M13615N50M55N4M        75     72990     86734     13745
##   [2]     chr4      - 4M13615N50M55N21M        75     73007     86751     13745
##           njunc |                     seq
##       <integer> |          <DNAStringSet>
##   [1]         2 | AAAAACTGCA...CGTAGCCACA
##   [2]         2 | ATACCTGTGA...TGGACGGCTG
##   -------
##   seqinfo: 8 sequences from an unspecified genome

Note that we have redefined the ScanBamParam. This is because readGAlignments comes with predefined fields to import and we just need to add those extra fields we want to import in the ScanBamParam parameter object.
Here we imported the sequences to show that they are imported as a DNAStringSet but it is generally not necessary to keep these sequences once the reads have been mapped.

So we don’t keep them for the next steps:

mysrbam2=readGAlignments(sr_bamFile,
                         param=ScanBamParam(which=which))
mysrbam2[1:2]
## GAlignments object with 2 alignments and 0 metadata columns:
##       seqnames strand             cigar    qwidth     start       end     width
##          <Rle>  <Rle>       <character> <integer> <integer> <integer> <integer>
##   [1]     chr4      + 21M13615N50M55N4M        75     72990     86734     13745
##   [2]     chr4      - 4M13615N50M55N21M        75     73007     86751     13745
##           njunc
##       <integer>
##   [1]         2
##   [2]         2
##   -------
##   seqinfo: 8 sequences from an unspecified genome

The object returned is a GAlignments (see ?'GAlignments-class' for details and accessors).

These objects are highly similar to GRanges. They can be accessed with similar functions:

head(start(mysrbam2))
## [1] 72990 73007 73007 73007 73007 73007
head(width(mysrbam2))
## [1] 13745 13745 13745 13745 13745 13745
seqnames(mysrbam2)
## factor-Rle of length 1040 with 1 run
##   Lengths: 1040
##   Values : chr4
## Levels(8): chr2L chr2R chr3L chr3R chr4 chrM chrX chrYHet
cigar(mysrbam2)[1:3]
## [1] "21M13615N50M55N4M" "4M13615N50M55N21M" "4M13615N50M55N21M"
head(njunc(mysrbam2))
## [1] 2 2 2 2 2 2

They can be converted to GRanges:

granges(mysrbam2)[1:2]
## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames      ranges strand
##          <Rle>   <IRanges>  <Rle>
##   [1]     chr4 72990-86734      +
##   [2]     chr4 73007-86751      -
##   -------
##   seqinfo: 8 sequences from an unspecified genome

And one can easily access the details of each read alignment as GRanges organized in a GRangesList object using:

grglist(mysrbam2)[[1]] #only first read shown element here
## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames      ranges strand
##          <Rle>   <IRanges>  <Rle>
##   [1]     chr4 72990-73010      +
##   [2]     chr4 86626-86675      +
##   [3]     chr4 86731-86734      +
##   -------
##   seqinfo: 8 sequences from an unspecified genome

junctions(mysrbam2)[[1]] #and the corresponding junctions
## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames      ranges strand
##          <Rle>   <IRanges>  <Rle>
##   [1]     chr4 73011-86625      +
##   [2]     chr4 86676-86730      +
##   -------
##   seqinfo: 8 sequences from an unspecified genome

See below and in help('junctions-methods',"GenomicAlignments"), help('findOverlaps-methods',"GenomicAlignments") and help('coverage-methods',"GenomicAlignments") for other methods defined on GAlignments objects.

7.1.2 Paired-end reads

An example of paired-end data is available in the pasillaBamSubset package:

pr_bamFile=untreated3_chr4() # from passilaBamSubset package

We can import this data in R using:

myprbam=readGAlignmentPairs(pr_bamFile,
                              param=ScanBamParam(which=which))
myprbam[1:2]
## GAlignmentPairs object with 2 pairs, strandMode=1, and 0 metadata columns:
##       seqnames strand :      ranges --        ranges
##          <Rle>  <Rle> :   <IRanges> --     <IRanges>
##   [1]     chr4      - : 13711-13747 --   74403-77053
##   [2]     chr4      * : 74403-77053 -- 955236-964043
##   -------
##   seqinfo: 8 sequences from an unspecified genome

The GAlignmentPairs class holds only read pairs (reads with no mate or with ambiguous pairing are discarded).
The readGalignmentPairs function has a strandMode argument to specify how to report the strand of a pair.
For stranded protocols, depending how the libraries were generated, strandMode should be set to 1 (the default for e.g. directional Illumina protocol by ligation) or 2 (e.g. for dUTP or Illumina stranded TruSeq PE protocol).

The individual reads can be accessed as GAlignments objects using:

myprbam[1] #first record
## GAlignmentPairs object with 1 pair, strandMode=1, and 0 metadata columns:
##       seqnames strand :      ranges --      ranges
##          <Rle>  <Rle> :   <IRanges> --   <IRanges>
##   [1]     chr4      - : 13711-13747 -- 74403-77053
##   -------
##   seqinfo: 8 sequences from an unspecified genome
first(myprbam[1]) #first sequenced fragment
## GAlignments object with 1 alignment and 0 metadata columns:
##       seqnames strand       cigar    qwidth     start       end     width
##          <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
##   [1]     chr4      -         37M        37     13711     13747        37
##           njunc
##       <integer>
##   [1]         0
##   -------
##   seqinfo: 8 sequences from an unspecified genome
last(myprbam[1]) #last sequenced fragment
## GAlignments object with 1 alignment and 0 metadata columns:
##       seqnames strand       cigar    qwidth     start       end     width
##          <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
##   [1]     chr4      +  33M2614N4M        37     74403     77053      2651
##           njunc
##       <integer>
##   [1]         1
##   -------
##   seqinfo: 8 sequences from an unspecified genome

We could also use the readGAlignmentsList function which returns both mate-pairs and non-mates in a more classical “list-like” structure (see help('readGAlignmentsList', package="GenomicAlignments") for examples).

7.2 Typical operations on GAlignments objects

7.2.1 Coverage

Computing a coverage (number of reads aligning at a given position on the genome) on a GAlignments is straightforward:

cvg_sr = coverage(mysrbam2)
cvg_sr$chr4
## integer-Rle of length 1351857 with 1266 runs
##   Lengths:  72989     17      4   2092     75 ...      5   6175     49 143448
##   Values :      0      1      9      0      1 ...      1      0      1      0

The result is an RleList organized by chromosomes.

We can extract specific regions from this object using a GRanges object.
For example, let’s get the 1kb promoters of genes located within the region that we have defined earlier on Chr4:

#Regions of interest:
which_chr4_gr = GRanges(c("chr4:75000-85000:*",
                          "chr4:1190000-1203000:*"))
#promoters in these regions
prom_chr4 = subsetByOverlaps(
              promoters(genes(TxDb.Dmelanogaster.UCSC.dm3.ensGene), upstream =1000, downstream = 0),
              which_chr4_gr)

Now extract the coverage only on these promoters:

prom_chr4_coverage = cvg_sr[prom_chr4]
names(prom_chr4_coverage) = names(prom_chr4)
prom_chr4_coverage
## RleList of length 3
## $FBgn0002521
## integer-Rle of length 1000 with 1 run
##   Lengths: 1000
##   Values :    0
## 
## $FBgn0004859
## integer-Rle of length 1000 with 16 runs
##   Lengths:  72   1  74  51  17  29   1  13  17  15  29  14  17  59  75 516
##   Values :   1   2   1   0   1   2   3   2   3   4   3   2   1   0   1   0
## 
## $FBgn0264617
## integer-Rle of length 1000 with 14 runs
##   Lengths: 778   5   2  20  48   5   2  20   5  45  14  16  37   3
##   Values :   0   1   2   3   4   3   2   1   0   1   2   3   2   3

We could also use Views to do the same operation but that will only work for a single chromosome:

Views(cvg_sr$chr4, ranges(prom_chr4))
## Views on a 1351857-length Rle subject
## 
## views:
##       start     end width
## [1] 1202272 1203271  1000 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...]
## [2]   77668   78667  1000 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...]
## [3]   76123   77122  1000 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...]

Note that the coverage function is not strand specific.
If we want the coverage for the minus strand only, we could use:

coverage(mysrbam2[strand(mysrbam2)=="-"])$chr4
## integer-Rle of length 1351857 with 724 runs
##   Lengths:  73006      4   3895     75     86 ...      5   6175     49 143448
##   Values :      0      8      0      1      0 ...      1      0      1      0

Often, we need to resize or shift the reads when computing the coverage.
This can be done on the GAlignment object using e.g. resize or shift.
For example with resize (which by default is strand-aware):

coverage(resize(granges(mysrbam2), width = 300, fix = "start"))$chr4 
## integer-Rle of length 1351857 with 1330 runs
##   Lengths:  72989    300   1813    300    157 ...      5   5924    300 143448
##   Values :      0      1      0      1      0 ...      1      0      1      0

The shift option also exist in the coverage function for GAlignments:

#shift directly in coverage !not strand-aware!:
coverage(mysrbam2, shift=150)$chr4 
## integer-Rle of length 1351857 with 1266 runs
##   Lengths:  73139     17      4   2092     75 ...      5   6175     49 143298
##   Values :      0      1      9      0      1 ...      1      0      1      0
#shift is not strand-aware but we can make it strand-aware like this:
coverage(mysrbam2, shift=150*as.numeric(paste0(strand(mysrbam2), 1)))$chr4 
## Warning in S4Vectors:::V_recycle(arg, x, argname, "x"): 'length(x)' is not a
## multiple of 'NROW(shift)'
## integer-Rle of length 1351857 with 1376 runs
##   Lengths:  72856      4    279     17      4 ...      5   6175     49 143298
##   Values :      0      4      0      1      5 ...      1      0      1      0

7.2.2 Find regions of high/low signal

In ChIP-seq experiments, one often look for “peaks” (or broad regions in e.g. ChIP of some chromatin marks or PolII) in the read coverage. This task, often called “peak calling” or “peak finding” can be performed outside of R with one of the numerous “peak callers” available such as MACS, SPP or SICER (for broad regions).

A number of specific and advanced tools also exist in R to perform peak calling such as: chipseq, bumphunter, CSAR, or PICS.

Here, we only illustrate a naive approach to get a list of peaks by applying a single threshold on the coverage object using the slice function:

slice(cvg_sr, lower=60)$chr4
## Views on a 1351857-length Rle subject
## 
## views:
##     start   end width
## [1] 80870 80903    34 [60 61 62 63 64 65 65 65 65 64 61 62 63 63 65 65 64 ...]
## [2] 80929 80932     4 [60 60 60 60]
## [3] 80934 80936     3 [60 61 61]
## [4] 80972 81017    46 [61 61 64 64 65 65 65 71 71 71 72 72 72 71 70 70 70 ...]

7.3 Counting reads / read summarization

Another common task is to count how many reads align to a set of genomic features.
This counting operation is sometimes called read summarization or data reduction.
It is typically done in RNA-seq experiments to produce a count matrix for subsequent analysis.
The counts can be obtained on e.g. genes, transcripts or exons depending on the aim of the study.
Some packages such as Rsubread, QuasR or easyRNAseq provide specific functions to perform the counting step.
Here we only present the general function summarizeOverlaps from GenomicAlignments.
We want to count the reads aligning on the exons. So we first get the exons, organized by genes, which are located in our regions of interest on chr4:

exbygn_chr4 = subsetByOverlaps(exonsBy(TxDb.Dmelanogaster.UCSC.dm3.ensGene,
                                     by="gene"), 
                               which_chr4_gr)

Then we use the summarizeOverlaps function to obtain the counts (Note that the counting is strand-aware by default!):

count_res = summarizeOverlaps(exbygn_chr4, mysrbam2, mode="Union")
count_res
## class: RangedSummarizedExperiment 
## dim: 3 1 
## metadata(0):
## assays(1): counts
## rownames(3): FBgn0002521 FBgn0004859 FBgn0264617
## rowData names(0):
## colnames(1): reads
## colData names(2): object records
assays(count_res)$counts
##             reads
## FBgn0002521   346
## FBgn0004859    12
## FBgn0264617   117

To count reads from multiple BAM files we just need to enter the path to the BAM files as a character vector:

count_res2=summarizeOverlaps(exbygn_chr4,
                             c(sr_bamFile,pr_bamFile),
                             mode="Union")

assays(count_res2)$counts
##             untreated1_chr4.bam untreated3_chr4.bam
## FBgn0002521                 346                 210
## FBgn0004859                 200                  96
## FBgn0264617                 117                  53

When counting reads over regions of interest there are different choices to make. Different count modes are available, as illustrated in the documentation of HTSeq

Different modes of read counting (from HTSeq doc)
Different modes of read counting (from HTSeq doc)

The result of summarizeOverlaps (here count_res) is a SummarizedExperiment object defined in the SummarizedExperiment package and introduced in (Huber et al. 2015).
In aSummarizedExperiment object, the features (i.e. rows) are ranges of interest and the columns are samples. One or more assays represented by matrix-like objects can be stored for these ranges x samples, along with metadata for the rows (rowData), columns (colData) and the experiments (exptData), as illustrated below.

Structure of the SummarizedExperiment objects
Structure of the SummarizedExperiment objects

There are several options available to perform read counting with summarizeOverlaps, such as to preprocess the reads before counting.
See ?summarizeOverlaps}` for details and examples.
The count matrix could then be used for normalization and differential analysis with the DESeq2, edgeR, limma, DEXseq packages for example.

8 Annotation packages in Bioconductor

8.1 Types of annotation packages

There are a number of annotation packages in Bioconductor which can be browsed in the corresponding BioC View.
Annotations are provided as AnnotationDb objects and the AnnotationDbi package provides a common interface to these objects.
The vignette of the AnnotationDbi provides a good introduction to the different types of package available, illustrated below.

AnnotationDb packages
AnnotationDb packages

There are gene-centric packages:

And genome-centric packages:

In addition, OrganismDb packages were added based on the OrganismDbi package. These packages basically combine all the above packages in a single package. These are available for Homo.sapiens, Mus.musculus and Rattus.norvegicus.

Packages are available in Bioconductor to build your own annotation packages. These include AnnotationForge and GenomicFeatures and txdbmaker

The AnnotationHub package also provides an interface to access a large resource of genomic files, which can be used as extra “annotations”.

To get more information about annotations in Bioconductor, take a look at the annotation workflow.

8.2 Accessing annotations

There are 4 basic functions to access nearly all annotations:

  • The column function shows the fields from which annotations can be extracted
  • The keytypes function shows the field from which annotations can be extracted AND which can be used as keys to access these annotations
  • The keys function retrieves the keys themselves (i.e. values from a ‘keytype’ field used to access the annotation of the corresponding feature)
  • The select function extracts the data from the AnnotationDb object from the columns specified and providing a set of keys of a given keytype.

We illustrate these functions on some annotation packages.

First, we explore the OrgDb package which contains gene-level annotations:

## library(org.Dm.eg.db)
columns(org.Dm.eg.db)
##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "FLYBASE"     
## [11] "FLYBASECG"    "FLYBASEPROT"  "GENENAME"     "GENETYPE"     "GO"          
## [16] "GOALL"        "MAP"          "ONTOLOGY"     "ONTOLOGYALL"  "PATH"        
## [21] "PMID"         "REFSEQ"       "SYMBOL"       "UNIPROT"
## help("PATH")
## keytypes(org.Dm.eg.db) #same as columns(org.Dm.eg.db) in this case
uniKeys = keys(org.Dm.eg.db, keytype="UNIPROT")[c(13286, 13288)]

cols = c("SYMBOL","GO")
select(org.Dm.eg.db,
       keys=uniKeys,
       columns=cols,
       keytype="UNIPROT")
##    UNIPROT  SYMBOL         GO EVIDENCE ONTOLOGY
## 1   Q94513 BEAF-32 GO:0000785      IDA       CC
## 2   Q94513 BEAF-32 GO:0003677      IDA       MF
## 3   Q94513 BEAF-32 GO:0003677      IEA       MF
## 4   Q94513 BEAF-32 GO:0005515      IPI       MF
## 5   Q94513 BEAF-32 GO:0005634      IBA       CC
## 6   Q94513 BEAF-32 GO:0005634      IDA       CC
## 7   Q94513 BEAF-32 GO:0005705      IDA       CC
## 8   Q94513 BEAF-32 GO:0006325      IDA       BP
## 9   Q94513 BEAF-32 GO:0006325      IMP       BP
## 10  Q94513 BEAF-32 GO:0006357      IBA       BP
## 11  Q94513 BEAF-32 GO:0006357      IDA       BP
## 12  Q94513 BEAF-32 GO:0043035      IDA       MF
## 13  Q94513 BEAF-32 GO:0048477      IMP       BP
## 14  Q7JN06 BEAF-32 GO:0000785      IDA       CC
## 15  Q7JN06 BEAF-32 GO:0003677      IDA       MF
## 16  Q7JN06 BEAF-32 GO:0003677      IEA       MF
## 17  Q7JN06 BEAF-32 GO:0005515      IPI       MF
## 18  Q7JN06 BEAF-32 GO:0005634      IBA       CC
## 19  Q7JN06 BEAF-32 GO:0005634      IDA       CC
## 20  Q7JN06 BEAF-32 GO:0005705      IDA       CC
## 21  Q7JN06 BEAF-32 GO:0006325      IDA       BP
## 22  Q7JN06 BEAF-32 GO:0006325      IMP       BP
## 23  Q7JN06 BEAF-32 GO:0006357      IBA       BP
## 24  Q7JN06 BEAF-32 GO:0006357      IDA       BP
## 25  Q7JN06 BEAF-32 GO:0043035      IDA       MF
## 26  Q7JN06 BEAF-32 GO:0048477      IMP       BP

The GO.db package can be used to retrieve information on the identified Gene Ontology categories (chosen also based on Evidence codes):

## library(GO.db)
mygos=c("GO:0006325","GO:0006357", "GO:0048477")
select(GO.db,
       columns=columns(GO.db)[2:4],
       keys=mygos,
       keytype="GOID")
## 'select()' returned 1:1 mapping between keys and columns
##         GOID ONTOLOGY                                             TERM
## 1 GO:0006325       BP                           chromatin organization
## 2 GO:0006357       BP regulation of transcription by RNA polymerase II
## 3 GO:0048477       BP                                        oogenesis

We can also extract a whole table as a data.frame:

## ls("package:GO.db")
toTable(GOTERM)[1:3,2:4]
##        go_id                                                  Term Ontology
## 1 GO:0000001                             mitochondrion inheritance       BP
## 2 GO:0000002                      mitochondrial genome maintenance       BP
## 3 GO:0000006 high-affinity zinc transmembrane transporter activity       MF

We can also search for a specific pattern in the keys:

keys(org.Dm.eg.db, 
     keytype="SYMBOL", 
     pattern="EcR")
## [1] "EcR"    "DopEcR"
select(org.Dm.eg.db,
       keys=c("EcR","DopEcR"),
       columns=c("ENTREZID","FLYBASE","GENENAME"),
       keytype="SYMBOL")
## 'select()' returned 1:1 mapping between keys and columns
##   SYMBOL ENTREZID     FLYBASE                      GENENAME
## 1    EcR    35540 FBgn0000546             Ecdysone receptor
## 2 DopEcR    38539 FBgn0035538 Dopamine/Ecdysteroid receptor

Finally, TxDb packages can also be interrogated using the same methods:

select(TxDb.Dmelanogaster.UCSC.dm3.ensGene,
       columns = c("TXID", "TXCHROM", "TXSTRAND", "TXSTART", "TXEND"),
       keys = c("FBgn0039183", "FBgn0264342", "FBgn0030583"),
       keytype = 'GENEID')
## 'select()' returned 1:1 mapping between keys and columns
##        GENEID  TXID TXCHROM TXSTRAND  TXSTART    TXEND
## 1 FBgn0039183 19327   chr3R        + 20159145 20162971
## 2 FBgn0264342   559   chr2L        +  4647051  4647920
## 3 FBgn0030583 25509    chrX        + 14730758 14731363

9 Import/export of genomic data

9.1 The rtracklayer package

The main functionality of the rtracklayer package (Lawrence, Gentleman, and Carey 2009) is to import/export genomic files in a number of different standard formats.
These include GFF, BED, wig, bigWig and bedGraph.
Additionally, rtracklayer also provides functions to interact with genome browsers and in particular the UCSC genome browser (not illustrated here).

rtracklayer has an import and an export function which will work on most files with their standard file extension.
In addition, depending on the file type, other options will be available, for example to import data only from specific genomic regions or to import the object as an RleList or a GRanges object.

9.2 The AnnotationHub package

The AnnotationHub allows to easily retrieve a number of public datasets and annotation tracks as standard Bioconductor objects such as GRanges and GRangesList.
Note however that many datasets have been pre-processed so the user relies on others for these steps. The datasets available come from e.g. ENSEMBL, NCBI, UCSC, ENCODE or Inparanoid.
We will not illustrate the usage of this package but if you’re looking for a specific dataset or instead to collect many datasets on your favorite model/tissue/etc. then it is worth exploring the vignettes of the AnnotationHub package.

References

Huber, W., V. J. Carey, R. Gentleman, S. Anders, M. Carlson, B. S. Carvalho, H. C. Bravo, et al. 2015. Orchestrating high-throughput genomic analysis with Bioconductor.” Nat. Methods 12 (2): 115–21.
Lawrence, M., R. Gentleman, and V. Carey. 2009. rtracklayer: an R package for interfacing with genome browsers.” Bioinformatics 25 (14): 1841–42.
Lawrence, M., W. Huber, H. Pages, P. Aboyoun, M. Carlson, R. Gentleman, M. T. Morgan, and V. J. Carey. 2013. Software for computing and annotating genomic ranges.” PLoS Comput. Biol. 9 (8): e1003118.