It is important to recognize that a large fraction of the data that we manipulate in genomics consists of genomic intervals.
A genomic interval is defined by:
+
, -
or *
for undefined)In addition, the genomic interval can have metadata attached to it, such as a score, a GC% or any number of quantitative or qualitative data.
Thus:
ChIP-seq peaks, gene/exons/enhancer or any other annotations, localization of a motif or a particular sequence are obvious examples of genomic intervals.
But a read aligned to the genome, or a coverage along the genome can also be represented by genomic intervals.
Indeed, many file formats directly contain genomic intervals:
gtf/gff
files for annotationsbed/bigBed
or narrowPeak
and other ENCODE files for intervalsbam/sam
files or paf
files for aligned sequencesvcf
files for variantsAdditionally, file formats containing coverages such as bedgraph
, wig
or bigwig
files can also be represented as genomic intervals associated with a specific score.
Here are some questions that typically correspond to range-based queries:
Which peaks are unique to one condition and not present in the other (comparison of 2 ChIP-seq datasets)?
=> setdiff(peaks_condition1, peaks_condition2)
(Set operations)
Which genes overlap with my ChIP-seq peaks?
=> findOverlaps(peaks, genes)
(Overlap query)
What are the distances between my SNPs and the nearest transcription start site (TSS)?
=> nearest(snps, tss)
(Nearest-neighbor query)
Which exons are fully contained with a given genomic region?
=> subsetByOverlaps(exons, region, type = "within")
(Containment query)
Are their any enhancers with 5kb upstream of gene TSS?
=> findOverlaps(enhancer, promoters(gene, upstream=5000, downstream=0))
(Proximity-based overlap)
What are the regions not covered by my ChIP-seq peaks?
=> gaps(peaks)
(Gap calculation)
Which genetic variants intersect with coding regions?
=> intersect(variants, coding_regions)
(Set operations)
What are the sequences located within 250bp of my ChIP-seq peak summits?
=> getSeq(myBSgenome, resize(peaks, width=500, fix = "center"))
(Window normalization)
When manipulating genomic intervals, one of the most confusing aspect
(together with dealing with poorly formatted gtf/gff files 😉)
is linked to the existence of two coordinate systems: 0-based or 1-based.
The figure below, extracted from this post summarizes the two systems:
To summarize:
In Bioconductor, the 1-based system is always used.
However:
bed
files use the 0-based systemgtf/gff
, sam/bam
or vcf
files use the 1-based systemThus it is important, when you import/export your data to/from R/Bioconductor, that you use functions that are aware of these different standards.
We will see in paragraph 9 that the rtracklayer package provides an interface to import/export most genomic file formats.
Bioconductor implements a number of tools to manipulate and analyze ranges and specifically genomic ranges (Lawrence et al. 2013)
Let’s start with the simpler objects defined in the IRanges package.
An IRanges object is defined as follow:
A simple IRanges:
## IRanges object with 3 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## A 1 4 4
## B 10 10 1
## C 20 19 0
A bigger IRanges:
set.seed(123) #For reproducibility
start = floor(runif(10000, 1, 1000))
end = start + floor(runif(10000, 0, 100))
ir = IRanges(start, end)
ir
## IRanges object with 10000 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 288 319 32
## [2] 788 820 33
## [3] 409 496 88
## [4] 883 915 33
## [5] 940 952 13
## ... ... ... ...
## [9996] 466 493 28
## [9997] 899 984 86
## [9998] 114 125 12
## [9999] 571 596 26
## [10000] 900 966 67
IRanges accessors:
length(ir)
## [1] 10000
ir[1:4]
## IRanges object with 4 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 288 319 32
## [2] 788 820 33
## [3] 409 496 88
## [4] 883 915 33
start(ir[1:4])
## [1] 288 788 409 883
width(ir[1:4])
## [1] 32 33 88 33
names(eg)
## [1] "A" "B" "C"
Other useful methods for IRanges:
c(ir[1:2],ir[5:6]) #combining
## IRanges object with 4 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 288 319 32
## [2] 788 820 33
## [3] 940 952 13
## [4] 46 81 36
sort(ir[1:4])
## IRanges object with 4 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 288 319 32
## [2] 409 496 88
## [3] 788 820 33
## [4] 883 915 33
rank(ir[1:4],ties="first")
## [1] 1 3 2 4
mid(ir[1:4]) # midpoints
## [1] 303 804 452 899
tile(ir[1:2],n=2) #returns an IRangesList (see below)
## IRangesList object of length 2:
## [[1]]
## IRanges object with 2 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 288 303 16
## [2] 304 319 16
##
## [[2]]
## IRanges object with 2 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 788 803 16
## [2] 804 820 17
rep(ir[1:2],each=2)
## IRanges object with 4 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 288 319 32
## [2] 288 319 32
## [3] 788 820 33
## [4] 788 820 33
isNormal(ir[1:4])
## [1] FALSE
isNormal(sort(ir[1:4])) #see ?'Ranges-class' for Normality definition
## [1] TRUE
isDisjoint(ir[1:4])
## [1] TRUE
match(ir[1:4],ir[4:1]) #see ?'Ranges-comparison' for Ranges comparison methods
## [1] 4 3 2 1
ir[1:4]>ir[4:1] # compares ranges a) first by start and b) then by width
## [1] FALSE TRUE FALSE TRUE
Other methods creating IRanges:
as(c(2:10,8,90:100),"IRanges") #from a vector of integers
## IRanges object with 3 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 2 10 9
## [2] 8 8 1
## [3] 90 100 11
successiveIRanges(width=rep(10,5),gap=10)
## IRanges object with 5 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 1 10 10
## [2] 21 30 10
## [3] 41 50 10
## [4] 61 70 10
## [5] 81 90 10
whichAsIRanges(c(19, 5, 0, 8, 5)>=5) #transforms a logical vector in IRanges
## NormalIRanges object with 2 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 1 2 2
## [2] 4 5 2
It can be convenient to group ranges in a list (e.g. exons grouped by genes). IRangesList
objects serve this purpose.
Accessors and functions for IRanges generally work on IRangesList objects.
## IRanges object with 96 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 321 321 1
## [2] 600 600 1
## [3] 184 184 1
## [4] 297 297 1
## [5] 276 276 1
## ... ... ... ...
## [92] 188 188 1
## [93] 308 308 1
## [94] 289 289 1
## [95] 936 936 1
## [96] 669 669 1
## IntegerList of length 100
## [["1"]] 321 600 184 297 276 816 87 729 407 ... 72 858 52 85 188 308 289 936 669
## [["2"]] 915 576 706 235 678 647 451 138 744 ... 638 66 979 740 300 869 433 645
## [["3"]] 457 415 336 774 487 787 587 352 400 ... 264 60 607 292 709 418 552 102
## [["4"]] 253 75 429 4 785 24 464 869 433 ... 224 827 240 175 459 130 146 681 267
## [["5"]] 498 812 977 120 991 583 959 931 532 ... 311 157 871 538 209 8 37 39 443
## [["6"]] 372 298 241 61 363 351 847 37 916 ... 571 910 929 656 71 839 223 34 602
## [["7"]] 753 198 674 584 850 250 962 14 543 ... 457 175 731 758 953 212 551 153
## [["8"]] 895 803 784 358 366 536 530 323 811 ... 350 959 592 651 487 406 895 53
## [["9"]] 318 478 991 747 769 626 332 436 548 ... 114 546 928 69 415 219 568 546
## [["10"]] 467 674 225 194 464 87 494 364 957 ... 706 267 938 259 685 700 435 937
## ...
## <90 more elements>
## 1 2 3 4 5 6
## 96 83 108 95 84 110
These operations apply on each range of an IRanges object:
ir[1:2]
## IRanges object with 2 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 288 319 32
## [2] 788 820 33
shift(ir[1:2],shift=10)
## IRanges object with 2 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 298 329 32
## [2] 798 830 33
resize(ir[1:2],width=100,fix="start")
## IRanges object with 2 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 288 387 100
## [2] 788 887 100
flank(ir[1:2],width=100,start=T)
## IRanges object with 2 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 188 287 100
## [2] 688 787 100
narrow(ir[1:2],start=1,width=30) #here 'start' is relative
## IRanges object with 2 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 288 317 30
## [2] 788 817 30
ir[1:2]+10
## IRanges object with 2 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 278 329 52
## [2] 778 830 53
ir[1:2]-10
## IRanges object with 2 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 298 309 12
## [2] 798 810 13
ir[1:2]+c(0,10)
## IRanges object with 2 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 288 319 32
## [2] 778 830 53
ir[1:4]*-10 ; ir[1:4]*10 # acts like a centered zoom
## IRanges object with 4 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 144 463 320
## [2] 639 968 330
## [3] 13 892 880
## [4] 734 1063 330
## IRanges object with 4 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 302 304 3
## [2] 803 805 3
## [3] 449 456 8
## [4] 898 900 3
ir[1:2]*c(1,2) #zoom second range by 2X
## IRanges object with 2 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 288 319 32
## [2] 796 811 16
See help('intra-range-methods',package="IRanges")
for more.
To illustrate these operations, we first build a function to plot ranges (from the IRanges vignette):
plotRanges <- function(x,
xlim = x,
main = deparse(substitute(x)),
col = "black", sep = 0.5,
cextitle=1, cexaxis=1, xaxis=T,...)
{
height <- 1
if (is(xlim, "Ranges"))
xlim <- c(min(start(xlim)), max(end(xlim)))
bins <- disjointBins(IRanges(start(x), end(x) + 1))
plot.new()
plot.window(xlim, c(0, max(bins)*(height + sep)))
ybottom <- bins * (sep + height) - height
rect(start(x)-0.5, ybottom, end(x)+0.5, ybottom + height, col = col, ...)
if (xaxis)
(axis(1,cex.axis=cexaxis,padj=1))
title(main,cex.main=cextitle)
}
These inter-range operations, called endomorphisms, apply on a set of ranges and return a set of ranges:
#select some ranges in [1:100]:
irs=ir[which(start(ir)<=100 &
end(ir)<=100)[c(3:4,8,14,17:18)]]
irs
## IRanges object with 6 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 25 44 20
## [2] 46 60 15
## [3] 53 65 13
## [4] 9 33 25
## [5] 1 50 50
## [6] 87 96 10
reduce(irs)
## IRanges object with 2 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 1 65 65
## [2] 87 96 10
disjoin(irs)
## IRanges object with 10 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 1 8 8
## [2] 9 24 16
## [3] 25 33 9
## [4] 34 44 11
## [5] 45 45 1
## [6] 46 50 5
## [7] 51 52 2
## [8] 53 60 8
## [9] 61 65 5
## [10] 87 96 10
gaps(irs)
## IRanges object with 1 range and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 66 86 21
coverage(irs)
## integer-Rle of length 96 with 11 runs
## Lengths: 8 16 9 11 1 5 2 8 5 21 10
## Values : 1 2 3 2 1 2 1 2 1 0 1
See help('inter-range-methods',package="IRanges")
for other methods.
We can illustrate these methods with:
par(mfrow=c(5,1), mar=rep(3,4))
plotRanges(irs,xlim=c(0,100),xaxis=T, cextitle=2, cexaxis=1.5)
plotRanges(reduce(irs),xlim=c(0,100),xaxis=T, cextitle=2, cexaxis=1.5)
plotRanges(disjoin(irs),xlim=c(0,100),xaxis=T, cextitle=2, cexaxis=1.5)
plotRanges(gaps(irs),xlim=c(0,100),xaxis=T, cextitle=2, cexaxis=1.5)
plot(1:100,c(coverage(irs),rep(0,4)),type="l",axes=F,xlab="",ylab="",lwd=3)
title(main="coverage",cex.main=2)
axis(side=2,lwd=2,cex.axis=2,at=0:3,labels=0:3)
axis(1,lwd=2,cex.axis=2,padj=1)
(#fig:fig_inter-range_ops)Inter-range operations.
See help(’setops-methods’,package="IRanges")
for details.
The union
and intersect
functions for IRanges:
union(irs[1:3],irs[4:6])
## IRanges object with 2 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 1 65 65
## [2] 87 96 10
intersect(irs[1:3],irs[4:6])
## IRanges object with 2 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 25 44 20
## [2] 46 50 5
par(mfrow=c(4,1))
plotRanges(irs[1:3], xlim=c(0,100), xaxis=T,
cextitle=2, cexaxis=1.5)
plotRanges(irs[4:6], xlim=c(0,100), xaxis=T,
cextitle=2, cexaxis=1.5)
plotRanges(union(irs[1:3], irs[4:6]), xlim=c(0,100), xaxis=T,
cextitle=2, cexaxis=1.5)
plotRanges(intersect(irs[1:3], irs[4:6]), xlim=c(0,100), xaxis=T,
cextitle=2, cexaxis=1.5)
(#fig:fig_IRanges_setops_unionintersect)Inter-range operations.
The setdiff
function for asymmetric differences:
setdiff(irs[1:3],irs[4:6])
## IRanges object with 1 range and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 51 65 15
setdiff(irs[4:6],irs[1:3])
## IRanges object with 3 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 1 24 24
## [2] 45 45 1
## [3] 87 96 10
par(mfrow=c(4,1))
plotRanges(irs[1:3],xlim=c(0,100),xaxis=T, cextitle=2, cexaxis=1.5)
plotRanges(irs[4:6],xlim=c(0,100),xaxis=T, cextitle=2, cexaxis=1.5)
plotRanges(setdiff(irs[1:3],irs[4:6]),xlim=c(0,100),xaxis=T,
cextitle=2, cexaxis=1.5)
plotRanges(setdiff(irs[4:6],irs[1:3]),xlim=c(0,100),xaxis=T,
cextitle=2, cexaxis=1.5)
(#fig:fig_IRanges_setops_setdiff)Asymetric differences with setdiff on IRanges.
The same functions can be applied in a parallel fashion (i.e. on the first elements of the provided IRanges, then on the second elements, etc.)
punion(irs[1:2],irs[4:5]) #element-wise (aka "parallel") union
## IRanges object with 2 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 9 44 36
## [2] 1 60 60
pintersect(irs[1:2],irs[4:5])
## IRanges object with 2 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 25 33 9
## [2] 46 50 5
psetdiff(irs[1:3],irs[4:6]) # asymmetric! difference
## IRanges object with 3 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 34 44 11
## [2] 51 60 10
## [3] 53 65 13
pgap(irs[1:3],irs[4:6])
## IRanges object with 3 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 34 33 0
## [2] 51 50 0
## [3] 66 86 21
See help(’nearest-methods’,package="IRanges")
for details and examples:
We will se other examples on GRanges objects in paragraph 6.4
These are mainly methods to find overlaps between ranges.
Examples of using these functions on GRanges objects are provided in paragraph 6.5
See also help(’findOverlaps-methods’,package="IRanges")
for details.
Run-length encoding is a data compression method
highly adapted to long vectors containing repeated values (e.g. coverage on a whole chromosome).
For example, the sequence \(\{1,1,1,1,1,1,2,3,3\}\) can be represented as:
In IRanges, the Rle
class is used to represent run-length encoded (compressed) atomic vectors.
Rle objects:
set.seed(123) #for reproducibility
lambda = c(rep(0.001, 3500), #From IRanges vignette
seq(0.001, 10, length = 500),
seq(10, 0.001, length = 500))
xRle=Rle(rpois(1e4, lambda))
yRle=Rle(rpois(1e4, lambda[c(251:length(lambda), 1:250)]))
xRle
## integer-Rle of length 10000 with 1630 runs
## Lengths: 326 1 1150 1 772 1 395 ... 1 170 1 466 1 262
## Values : 0 1 0 1 0 1 0 ... 1 0 1 0 1 0
yRle
## integer-Rle of length 10000 with 1616 runs
## Lengths: 1984 1 1268 1 15 1 1 ... 1 1 1 8 1 1264
## Values : 0 1 0 1 0 1 0 ... 1 0 1 0 1 0
as.vector(object.size(xRle)/object.size(as.vector(xRle))) #Gain of memory
## [1] 0.3545745
head(runValue(xRle))
## [1] 0 1 0 1 0 1
head(runLength(xRle))
## [1] 326 1 1150 1 772 1
head(start(xRle)) #starts of the runs
## [1] 1 327 328 1478 1479 2251
head(end(xRle)) #ends of the runs
## [1] 326 327 1477 1478 2250 2251
nrun(xRle) #number of runs
## [1] 1630
findRun(as.integer(c(100,200,300,1200)),xRle)
## [1] 1 1 1 3
coverage(irs)
## integer-Rle of length 96 with 11 runs
## Lengths: 8 16 9 11 1 5 2 8 5 21 10
## Values : 1 2 3 2 1 2 1 2 1 0 1
These objects support the basic methods associated with R atomic vectors:
xRle + yRle
## integer-Rle of length 10000 with 2111 runs
## Lengths: 326 1 1150 1 506 1 265 ... 1 170 1 466 1 262
## Values : 0 1 0 1 0 1 0 ... 1 0 1 0 1 0
xRle > 0
## logical-Rle of length 10000 with 211 runs
## Lengths: 326 1 1150 1 772 ... 170 1 466 1 262
## Values : FALSE TRUE FALSE TRUE FALSE ... FALSE TRUE FALSE TRUE FALSE
xRle > yRle
## logical-Rle of length 10000 with 367 runs
## Lengths: 326 1 1150 1 772 ... 170 1 466 1 262
## Values : FALSE TRUE FALSE TRUE FALSE ... FALSE TRUE FALSE TRUE FALSE
max(xRle)
## [1] 18
summary(xRle)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.9932 0.0000 18.0000
sqrt(xRle)
## numeric-Rle of length 10000 with 1630 runs
## Lengths: 326 1 1150 1 ... 1 466 1 262
## Values : 0.00000 1.00000 0.00000 1.00000 ... 1.00000 0.00000 1.00000 0.00000
rev(xRle)
## integer-Rle of length 10000 with 1630 runs
## Lengths: 262 1 466 1 170 1 105 ... 1 772 1 1150 1 326
## Values : 0 1 0 1 0 1 0 ... 1 0 1 0 1 0
table(xRle)
## xRle
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 8200 201 198 202 212 184 167 146 140 101 89 49 51 25 12 6
## 16 17 18
## 7 8 2
union(xRle, yRle)
## integer-Rle of length 20 with 20 runs
## Lengths: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## Values : 0 1 3 2 4 5 6 7 8 9 11 12 10 13 17 14 16 15 18 22
cor(xRle, yRle)
## [1] 0.5690394
See ?’Rle-class’
and ?’Rle-utils’
for other methods.
There are useful functions to perform fixed-width running window summaries:
runmean(xRle, k=100) # See ?'Rle-runstat' for other examples
## numeric-Rle of length 9901 with 1852 runs
## Lengths: 227 100 1051 100 673 100 296 ... 100 71 100 367 100 163
## Values : 0.00 0.01 0.00 0.01 0.00 0.01 0.00 ... 0.01 0.00 0.01 0.00 0.01 0.00
#same result, more flexible but much slower:
Rle(aggregate(xRle, start = 1:(length(xRle)-99), width = 100, FUN = mean))
## numeric-Rle of length 9901 with 1852 runs
## Lengths: 227 100 1051 100 673 100 296 ... 100 71 100 367 100 163
## Values : 0.00 0.01 0.00 0.01 0.00 0.01 0.00 ... 0.01 0.00 0.01 0.00 0.01 0.00
runq(xRle, k=100, i=10) #10th smallest value in windows of 100
## integer-Rle of length 9901 with 41 runs
## Lengths: 3558 7 2 6 4 102 3 ... 2 2 80 59 53 1090
## Values : 0 1 0 1 0 1 2 ... 3 4 3 2 1 0
One typical application of Rle objects is to store the coverage of NGS reads along a chromosome.
The coverage for all chromosomes can be stored in an RleList
object (see ?AtomicList
for details)
in which each chromosome will be an Rle element of the list.
Most functions defined for Rle objects also work on RleList objects.
Practically, any variable defined along a genome can be represented as
RleList
with one Rle
for each chromosome
ormcols
(metadata columns) of a GRanges
object (see below)Sometimes, it is desirable to manipulate several of these variables in the same object (e.g.
for plotting with Gviz).
The ?genomicvars
help page provides useful functions such as
bindAsGRanges
and mcolAsRleList
to switch from one representation to the other.
As we mentioned when looking at the Biostrings package, Views
objects are used
to store a sequence (a Vector object called the “subject”) together with a set
of ranges which define views onto the sequence.
Specific subclasses exist for different classes of “subject” Vectors, such as RleViews
from the IRanges
package and XStringViews
from the Biostrings package.
An RleViews
object stores an Rle
subject and its views:
coverage(irs)
## integer-Rle of length 96 with 11 runs
## Lengths: 8 16 9 11 1 5 2 8 5 21 10
## Values : 1 2 3 2 1 2 1 2 1 0 1
irs_views=Views(coverage(irs),start=c(-5,10,20,90),end=c(10,30,50,100))
irs_views #Views can be out of bound
## Views on a 96-length Rle subject
##
## views:
## start end width
## [1] -5 10 16 [1 1 1 1 1 1 1 1 2 2 ...]
## [2] 10 30 21 [2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3]
## [3] 20 50 31 [2 2 2 2 2 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 1 ...]
## [4] 90 100 11 [1 1 1 1 1 1 1 ...]
try(irs_views[[1]]) #but can't be extracted
## Error in getListElement(x, i, ...) : view is out of limits
irs_views[[2]]
## integer-Rle of length 21 with 2 runs
## Lengths: 15 6
## Values : 2 3
start(irs_views)
## [1] -5 10 20 90
See ?’RleViews-class’
for details.
Other ways to create Views:
Views(coverage(irs),irs[c(1,2,6)]) #use an IRanges to extract Views
## Views on a 96-length Rle subject
##
## views:
## start end width
## [1] 25 44 20 [3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2]
## [2] 46 60 15 [2 2 2 2 2 1 1 2 2 2 2 2 2 2 2]
## [3] 87 96 10 [1 1 1 1 1 1 1 1 1 1]
Views(coverage(irs),coverage(irs)>=1) #or a logical Rle
## Views on a 96-length Rle subject
##
## views:
## start end width
## [1] 1 65 65 [1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 ...]
## [2] 87 96 10 [1 1 1 1 1 1 1 1 1 1]
slice(coverage(irs), 3) #use slice
## Views on a 96-length Rle subject
##
## views:
## start end width
## [1] 25 33 9 [3 3 3 3 3 3 3 3 3]
successiveViews(coverage(irs),width=rep(20,4)) #get successive Views
## Views on a 96-length Rle subject
##
## views:
## start end width
## [1] 1 20 20 [1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2]
## [2] 21 40 20 [2 2 2 2 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2]
## [3] 41 60 20 [2 2 2 2 1 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2]
## [4] 61 80 20 [1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Furthermore there is a ViewsList
virtual class. Its specialized subclass RleViewsList
is useful
to store coverage vectors along with their specific views over a set of chromosomes.
xyRleList=RleList(xRle,yRle)
xyRleList_views=Views(xyRleList,
IRangesList(IRanges(start=c(3700,4000),width=20),
IRanges(yRle>17)+2))
xyRleList_views[[1]]
## Views on a 10000-length Rle subject
##
## views:
## start end width
## [1] 3700 3719 20 [1 0 2 7 2 9 5 3 3 6 4 4 3 7 5 5 3 5 8 1]
## [2] 4000 4019 20 [ 9 13 13 8 7 14 3 10 16 8 17 5 8 8 15 17 12 ...]
xyRleList_views[[2]]
## Views on a 10000-length Rle subject
##
## views:
## start end width
## [1] 3750 3754 5 [10 13 18 12 14]
## [2] 8250 8254 5 [12 8 22 10 9]
width(xyRleList_views)
## IntegerList of length 2
## [[1]] 20 20
## [[2]] 5 5
Specific functions are provided for fast looping over Views
and ViewsList
objects:
viewMins(irs_views) #same as min(irs_views)
## [1] 1 2 1 1
viewSums(irs_views) #same as sum(irs_views)
## [1] 12 48 70 7
viewWhichMaxs(irs_views) #get the (first) coordinate of viewMaxs
## [1] 9 25 25 90
# (which.max also works)
viewRangeMins(irs_views) #get the (first) range of viewMins
## IRanges object with 4 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 1 8 8
## [2] 9 24 16
## [3] 45 45 1
## [4] 87 96 10
viewApply(irs_views,sd)
## [1] 0.4216370 0.4629100 0.5143113 0.0000000
viewMeans(xyRleList_views)
## NumericList of length 2
## [[1]] 4.15 10.35
## [[2]] 13.4 12.2
Note that the min
, max
, sum
, mean
, which.min
and which.max
functions
also work on Views
(but not on RleViewsList
yet).
The corresponding view* functions might be deprecated in the future.
See ?'view-summarization-methods'
for details.
Here, we will present some of the classes and functions defined in the GenomicRanges package.
This package is central to Bioconductor users who analyze NGS data, so you should consider reading thoroughly the excellent vignettes associated with this package.
The main class defined in GenomicRanges is the GRanges
class which acts as a container for genomic locations and their associated annotations (see ?GRanges
).
GRanges (and GRangesList) build on IRanges (and IRangesList respectively) with the following specificities:
seqnames
(typically chromosomes) and strand
is stored along with the information on ranges (SEW)seqinfo
slot contains information on the sequences: names, length (seqlengths
), circularity and genomemcols
) containing additional information on each range (e.g. score, GC content, etc.) which are stored as a DataFrame
By convention, in Bioconductor genomic coordinates:
Here is the general structure of a GRanges
object:
It contains:
It is important to understand that, despite looking like a table, GRanges object are vector-like, or rather list-like objects.
They have a length
(the number of intervals) but do not have a dim
(or nrow
/ ncol
).
The GRanges
function can be used to create a GRanges object:
genes = GRanges(seqnames = c("chr2L", "chrX"),
ranges = IRanges(start = c(7529, 18962306),
end = c(9484, 18962925),
names = c("FBgn0031208", "FBgn0085359")),
strand = c("+", "-"),
seqlengths = c(chr2L=23011544L, chrX=22422827L))
slotNames(genes)
## [1] "seqnames" "ranges" "strand" "seqinfo"
## [5] "elementMetadata" "elementType" "metadata"
mcols(genes) = DataFrame(EntrezId = c("33155", "2768869"),
Symbol = c("CG11023", "CG34330"))
genome(genes)="dm3" #see ?seqinfo for details
genes
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | EntrezId Symbol
## <Rle> <IRanges> <Rle> | <character> <character>
## FBgn0031208 chr2L 7529-9484 + | 33155 CG11023
## FBgn0085359 chrX 18962306-18962925 - | 2768869 CG34330
## -------
## seqinfo: 2 sequences from dm3 genome
Note that there are different ways to define a GRanges object.
For example, you can also use a genome browser-style syntax to provide the genomic intervals:
## GRanges object with 2 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] Chr1 1-100 +
## [2] Chr2 150-200 -
## -------
## seqinfo: 2 sequences from an unspecified genome; no seqlengths
The GRanges accessors include IRanges accessors and others:
width(genes)
## [1] 1956 620
names(genes)
## [1] "FBgn0031208" "FBgn0085359"
seqnames(genes)
## factor-Rle of length 2 with 2 runs
## Lengths: 1 1
## Values : chr2L chrX
## Levels(2): chr2L chrX
strand(genes)
## factor-Rle of length 2 with 2 runs
## Lengths: 1 1
## Values : + -
## Levels(3): + - *
ranges(genes)
## IRanges object with 2 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## FBgn0031208 7529 9484 1956
## FBgn0085359 18962306 18962925 620
genes$Symbol
## [1] "CG11023" "CG34330"
mcols(genes)
## DataFrame with 2 rows and 2 columns
## EntrezId Symbol
## <character> <character>
## FBgn0031208 33155 CG11023
## FBgn0085359 2768869 CG34330
seqinfo(genes)
## Seqinfo object with 2 sequences from dm3 genome:
## seqnames seqlengths isCircular genome
## chr2L 23011544 NA dm3
## chrX 22422827 NA dm3
seqlevels(genes)
## [1] "chr2L" "chrX"
As for IRangesList
, there is a GRangesList
class which allows to store GRanges
in a list-type object.
This is typically used to store e.g. GRanges defining exons arranged by transcripts or genes or GRanges defining transcripts arranged by genes.
The vignette of the GenomicRanges package provides a good example of 2 transcripts, one of which has 2 exons:
gr1 = GRanges(seqnames = "chr2", ranges = IRanges(3, 6),
strand = "+", score = 5L, GC = 0.45)
gr2 = GRanges(seqnames = c("chr1", "chr1"),
ranges = IRanges(c(7, 13), width = 3),
strand = c("+", "-"), score = 3:4, GC = c(0.3, 0.5))
grl = GRangesList("txA" = gr1, "txB" = gr2)
grl
## GRangesList object of length 2:
## $txA
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | score GC
## <Rle> <IRanges> <Rle> | <integer> <numeric>
## [1] chr2 3-6 + | 5 0.45
## -------
## seqinfo: 2 sequences from an unspecified genome; no seqlengths
##
## $txB
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | score GC
## <Rle> <IRanges> <Rle> | <integer> <numeric>
## [1] chr1 7-9 + | 3 0.3
## [2] chr1 13-15 - | 4 0.5
## -------
## seqinfo: 2 sequences from an unspecified genome; no seqlengths
length(grl)
## [1] 2
elementNROWS(grl) #or lengths(grl)
## txA txB
## 1 2
grl["txB","GC"]
## GRangesList object of length 1:
## $txB
## GRanges object with 2 ranges and 1 metadata column:
## seqnames ranges strand | GC
## <Rle> <IRanges> <Rle> | <numeric>
## [1] chr1 7-9 + | 0.3
## [2] chr1 13-15 - | 0.5
## -------
## seqinfo: 2 sequences from an unspecified genome; no seqlengths
unlist(grl)
## GRanges object with 3 ranges and 2 metadata columns:
## seqnames ranges strand | score GC
## <Rle> <IRanges> <Rle> | <integer> <numeric>
## txA chr2 3-6 + | 5 0.45
## txB chr1 7-9 + | 3 0.30
## txB chr1 13-15 - | 4 0.50
## -------
## seqinfo: 2 sequences from an unspecified genome; no seqlengths
Please refer to the introductory vignette for further examples.
Most accessors and functions defined for GRanges also work on GRangesList.
However, note that mcols(grl)
now refers to metadata at the list level rather than at level of the individual GRanges
objects.
Here, we will briefly present the Bioconductor annotation packages which can provide genome-wide annotations directly as GRanges and GRangesList objects.
These packages, called TxDb* are databases that contain the annotation from a given organism (i.e. intervals representing genes, transcripts, exons, CDS, etc.).
They can be built using the txdbmaker package,
but TxDb packages are already available for many organisms and genome versions (see the corresponding BioCViews)
The functions to query these objects are defined in the GenomicFeatures package.
Here, we illustrate how to extract information from the TxDb.Dmelanogaster.UCSC.dm3.ensGene package.
Let’s load the package
Loading the package creates/loads the object/database TxDb.Dmelanogaster.UCSC.dm3.ensGene
A GRanges object with the genomic coordinates for the genes can be obtained using:
## GRanges object with 15682 ranges and 1 metadata column:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## FBgn0000003 chr3R 2648220-2648518 + | FBgn0000003
## FBgn0000008 chr2R 18024494-18060346 + | FBgn0000008
## FBgn0000014 chr3R 12632936-12655767 - | FBgn0000014
## FBgn0000015 chr3R 12752932-12797958 - | FBgn0000015
## FBgn0000017 chr3L 16615470-16640982 - | FBgn0000017
## ... ... ... ... . ...
## FBgn0264723 chr3L 12238610-12239148 - | FBgn0264723
## FBgn0264724 chr3L 15327882-15329271 + | FBgn0264724
## FBgn0264725 chr3L 12025657-12026099 + | FBgn0264725
## FBgn0264726 chr3L 12020901-12021253 + | FBgn0264726
## FBgn0264727 chr3L 22065469-22065720 + | FBgn0264727
## -------
## seqinfo: 15 sequences (1 circular) from dm3 genome
Other functions, such as transcripts
, exons
, cds
, promoters
, terminators
, microRNAs
and tRNAs
defined in the GenomiFeatures package allow to extract the corresponding genomic features.
A GRangesList object with the coordinates of the transcripts arranged by genes can be obtained using:
## GRangesList object of length 15682:
## $FBgn0000003
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | tx_id tx_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chr3R 2648220-2648518 + | 17345 FBtr0081624
## -------
## seqinfo: 15 sequences (1 circular) from dm3 genome
##
## $FBgn0000008
## GRanges object with 3 ranges and 2 metadata columns:
## seqnames ranges strand | tx_id tx_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chr2R 18024494-18060339 + | 7681 FBtr0100521
## [2] chr2R 18024496-18060346 + | 7682 FBtr0071763
## [3] chr2R 18024938-18060346 + | 7683 FBtr0071764
## -------
## seqinfo: 15 sequences (1 circular) from dm3 genome
##
## $FBgn0000014
## GRanges object with 4 ranges and 2 metadata columns:
## seqnames ranges strand | tx_id tx_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chr3R 12632936-12655767 - | 21863 FBtr0306337
## [2] chr3R 12633349-12653845 - | 21864 FBtr0083388
## [3] chr3R 12633349-12655300 - | 21865 FBtr0083387
## [4] chr3R 12633349-12655474 - | 21866 FBtr0300485
## -------
## seqinfo: 15 sequences (1 circular) from dm3 genome
##
## ...
## <15679 more elements>
The functions exonsBy
, cdsBy
, intronsByTranscript
, fiveUTRsByTranscript
, threeUTRsByTranscript
allow to extract the corresponding genomic features arranged in a GRangesList object.
Other functions such as transcriptsByOverlaps
and exonsByOverlaps
allow to extract genomic features for genomic locations specified by a GRanges object:
## GRanges object with 6 ranges and 1 metadata column:
## seqnames ranges strand | exon_id
## <Rle> <IRanges> <Rle> | <integer>
## [1] chr2L 7529-8116 + | 1
## [2] chr2L 8193-8589 + | 2
## [3] chr2L 8193-9484 + | 3
## [4] chr2L 8229-9484 + | 4
## [5] chr2L 8668-9484 + | 5
## [6] chrX 18962306-18962925 - | 75268
## -------
## seqinfo: 15 sequences (1 circular) from dm3 genome
When a TxDb package is paired with an appropriate BSgenome object, it is relatively straightforward to extract DNA sequences providing a GRanges or a GRangesList
getSeq(BSgenome.Dmelanogaster.UCSC.dm3, Dmg[1:2])
## DNAStringSet object of length 2:
## width seq names
## [1] 299 TCGACTGGAAGGTTGGCAGCTTC...CCGATATGGTTGGACCACAATCT FBgn0000003
## [2] 35853 CGCGGCGGTCGCATCGGAGTCGA...TATTTACCTGAAAAGCAATATAC FBgn0000008
Dmc = cdsBy(TxDb.Dmelanogaster.UCSC.dm3.ensGene, by="tx")
cds_seq = extractTranscriptSeqs(BSgenome.Dmelanogaster.UCSC.dm3, Dmc[1:2])
cds_seq
## DNAStringSet object of length 2:
## width seq names
## [1] 855 ATGGGCGAGCGGGATCAGCCACA...CGGTATGGCAACGAATATATTGA 1
## [2] 1443 ATGGGCGAGCGGGATCAGCCACA...AAATCGTCGACGGAGAGTTGTGA 2
translate(cds_seq)
## AAStringSet object of length 2:
## width seq names
## [1] 285 MGERDQPQSSERISIFNPPVYTQ...EHHDRFNEITQDDKSTVWQRIY* 1
## [2] 481 MGERDQPQSSERISIFNPPVYTQ...FTQSEMLYFRKKMALEIVDGEL* 2
There are several other useful functions defined in the GenomiFeatures package, such as:
transcriptLengths
to extract different metrics on the transcripts (length, number of exons, length of 5’ and 3’ UTRs)exonicParts
and intronicParts
to extract non-overlapping exonic/intronic partsextendExonsIntoIntrons
to extend exons into their adjacent intronscoverageByTranscript
to get the coverage by transcript (or CDS) of a set of rangesThe select
function also works to extract specific info from a TxDb objects.
See help("select-methods", package = "GenomicFeatures")
and the paragraph 8 on Annotation packages.
Most if not all functions defined for IRanges objects are also defined for GRanges and GRangesList objects and they are generally seqnames- and strand-aware.
These include:
shift
, etc. ; see paragraph 3.1 and help('intra-range-methods',"GenomicRanges")
reduce
, etc. ; see paragraph 3.2 and help('inter-range-methods',"GenomicRanges")
help('setops-methods',"GenomicRanges")
nearest
, etc. ; see paragraph 3.4 and help('nearest-methods',"GenomicRanges")
Here, we briefly illustrate some of these functions:
genes
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | EntrezId Symbol
## <Rle> <IRanges> <Rle> | <character> <character>
## FBgn0031208 chr2L 7529-9484 + | 33155 CG11023
## FBgn0085359 chrX 18962306-18962925 - | 2768869 CG34330
## -------
## seqinfo: 2 sequences from dm3 genome
genes2=Dmg[c(1:2,21:22,36:37)]
genes2
## GRanges object with 6 ranges and 1 metadata column:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## FBgn0000003 chr3R 2648220-2648518 + | FBgn0000003
## FBgn0000008 chr2R 18024494-18060346 + | FBgn0000008
## FBgn0000052 chr2L 6041178-6045970 - | FBgn0000052
## FBgn0000053 chr2L 7014861-7023940 - | FBgn0000053
## FBgn0000084 chrX 20065478-20067552 - | FBgn0000084
## FBgn0000092 chrX 2586765-2587919 - | FBgn0000092
## -------
## seqinfo: 15 sequences (1 circular) from dm3 genome
sort(genes2)
## GRanges object with 6 ranges and 1 metadata column:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## FBgn0000052 chr2L 6041178-6045970 - | FBgn0000052
## FBgn0000053 chr2L 7014861-7023940 - | FBgn0000053
## FBgn0000008 chr2R 18024494-18060346 + | FBgn0000008
## FBgn0000003 chr3R 2648220-2648518 + | FBgn0000003
## FBgn0000092 chrX 2586765-2587919 - | FBgn0000092
## FBgn0000084 chrX 20065478-20067552 - | FBgn0000084
## -------
## seqinfo: 15 sequences (1 circular) from dm3 genome
c(genes,genes2,ignore.mcols=T) #combine
## GRanges object with 8 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## FBgn0031208 chr2L 7529-9484 +
## FBgn0085359 chrX 18962306-18962925 -
## FBgn0000003 chr3R 2648220-2648518 +
## FBgn0000008 chr2R 18024494-18060346 +
## FBgn0000052 chr2L 6041178-6045970 -
## FBgn0000053 chr2L 7014861-7023940 -
## FBgn0000084 chrX 20065478-20067552 -
## FBgn0000092 chrX 2586765-2587919 -
## -------
## seqinfo: 15 sequences (1 circular) from dm3 genome
intersect(genes,c(genes2,Dmg['FBgn0031208'])) #set operations
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr2L 7529-9484 +
## -------
## seqinfo: 15 sequences (1 circular) from dm3 genome
nearest(genes,genes2) #nearest-methods
## [1] NA 5
nearest(genes,genes2,ignore.strand=T) #strand-aware by default
## [1] 3 5
precede(genes,genes2,ignore.strand=T)
## [1] 3 5
promoters(genes,upstream=200,downstream=1) #intra-range operations
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | EntrezId Symbol
## <Rle> <IRanges> <Rle> | <character> <character>
## FBgn0031208 chr2L 7329-7529 + | 33155 CG11023
## FBgn0085359 chrX 18962925-18963125 - | 2768869 CG34330
## -------
## seqinfo: 2 sequences from dm3 genome
reduce(Dmt[[2]]) #inter-range operations
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr2R 18024494-18060346 +
## -------
## seqinfo: 15 sequences (1 circular) from dm3 genome
See also help('GenomicRanges-comparison',"GenomicRanges")
for other functions for comparing GenomicRanges.
A very common task on genomic ranges is to search for overlaps between sets of genomic ranges which corresponds to an operation between ranges.
These functions are defined for several classes of objects, including IRanges, GRanges and their list-type counterparts IRangesList and GRangesList but also Views and ViewsList among others.
Details on function definitions can be found using ?'findOverlaps-methods'
.
As first example, let’s count how many of the transcription start sites (TSS) in the Drosophila melanogaster genome are located at more than 500bp from another gene:
Dm_tss = unlist(reduce(promoters(Dmt, upstream = 0, downstream = 1))) #get all TSS
cov_tss_g500 = countOverlaps(Dm_tss, Dmg+500) #strand-aware!
table(cov_tss_g500)
## cov_tss_g500
## 1 2 3 4 5 6 7 8 9
## 15586 4488 490 116 65 47 18 6 3
sum(cov_tss_g500>1)
## [1] 5233
cov_tss_g500_bs = countOverlaps(Dm_tss, Dmg+500, ignore.strand=TRUE) #both strands
sum(cov_tss_g500_bs>1)
## [1] 10768
Getting the corresponding overlaps with findOverlaps
:
## GRanges object with 2 ranges and 1 metadata column:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## FBgn0000003 chr3R 2648220-2648518 + | FBgn0000003
## FBgn0011904 chr3R 2648685-2648757 - | FBgn0011904
## -------
## seqinfo: 15 sequences (1 circular) from dm3 genome
Now, imagine we have a set of 10K NGS reads:
set.seed(0)
randomreads2L=GRanges(seqnames="chr2L",
ranges=IRanges(start=floor(runif(10000,5000,50000)),
width=100),
strand="*")
And we want to select only the reads overlapping with these two genes:
## GRanges object with 2 ranges and 1 metadata column:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## FBgn0031208 chr2L 7529-9484 + | FBgn0031208
## FBgn0263584 chr2L 21952-24237 + | FBgn0263584
## -------
## seqinfo: 15 sequences (1 circular) from dm3 genome
We could use:
## GRanges object with 1017 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr2L 7780-7879 *
## [2] chr2L 22284-22383 *
## [3] chr2L 22101-22200 *
## [4] chr2L 22375-22474 *
## [5] chr2L 22207-22306 *
## ... ... ... ...
## [1013] chr2L 22674-22773 *
## [1014] chr2L 23550-23649 *
## [1015] chr2L 22940-23039 *
## [1016] chr2L 22316-22415 *
## [1017] chr2L 8623-8722 *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
or:
## GRanges object with 1017 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr2L 7780-7879 *
## [2] chr2L 22284-22383 *
## [3] chr2L 22101-22200 *
## [4] chr2L 22375-22474 *
## [5] chr2L 22207-22306 *
## ... ... ... ...
## [1013] chr2L 22674-22773 *
## [1014] chr2L 23550-23649 *
## [1015] chr2L 22940-23039 *
## [1016] chr2L 22316-22415 *
## [1017] chr2L 8623-8722 *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
To count the number of reads overlapping with these 2 genes we use the GenomicAlignments and the SummarizedExperiment packages that are further detailed in the paragraph 7.3 below:
## reads
## FBgn0031208 465
## FBgn0263584 552
Once NGS reads have been aligned to a reference genome, we obtain a sam/bam file containing the results of the alignment.
A SAM file contains the same information as the corresponding BAM file but BAM is a binary format and thus its size is reduced and its content is not human-readable.
Outside of R, these files are often manipulated with samtools or Picard tools for example.
A nice explanation of what SAM/BAM files contain can be found here.
Here, we will mainly illustrate the following libraries:
Examples of BAM files from single- and paired-read sequencing are provided in the pasillaBamSubset.
We load the libraries if necessary:
require(SummarizedExperiment, quietly = TRUE)
require(GenomicAlignments, quietly = TRUE)
require(Rsamtools, quietly = TRUE)
require(pasillaBamSubset, quietly = TRUE)
We will see a first example of import using the Rsamtools package for info but most of the time you will want to use the GenomicAlignment package instead because it generates objects very similar to GRanges.
The path for the BAM file is obtained with:
If the BAM file was not indexed yet (i.e. a .bai file present in the same directory and named as the .bam file), we could build such an index using:
Note that a number of other functions are available to manipulate SAM/BAM files, such as asSam
/asBam
, sortBam
or mergeBam
(See ?scanBam
for details) but these manipulations are often done directly with samtools, outside of R.
Now, we could define:
ScanBamParam
object:#which is an IRangesList so there is not strand information:
which = IRangesList("chr2L"=IRanges(7000,10000),
"chr4"=IRanges(c(75000,1190000),c(85000,1203000)))
scanBamWhat() #available fields
## [1] "qname" "flag" "rname" "strand" "pos"
## [6] "qwidth" "mapq" "cigar" "mrnm" "mpos"
## [11] "isize" "seq" "qual" "groupid" "mate_status"
what = c("rname","strand","pos","qwidth","seq")
flag=scanBamFlag(isDuplicate=FALSE)
param=ScanBamParam(which=which,what=what,flag=flag)
See ?ScanBamParam
for other options and examples.
As briefly illustrated, it is easy to define filters, e.g. based on SAM flags using scanBamFlag
or on tags present in the SAM/BAM files, when importing your BAM files.
Now, we use scanBam
function from Rsamtools to import the data in R:
mysrbam=scanBam(sr_bamFile,param=param)
class(mysrbam)
## [1] "list"
names(mysrbam)
## [1] "chr2L:7000-10000" "chr4:75000-85000" "chr4:1190000-1203000"
sapply(mysrbam, lengths)["rname",] #number of imported reads
## chr2L:7000-10000 chr4:75000-85000 chr4:1190000-1203000
## 0 304 736
The resulting object is a list which is not always easy to manipulate for downstream applications so below we will rather illustrate the use of the GenomicAlignments package.
Nevertheless, Rsamtools may be useful if you want e.g. to work on a specific genomic region across several BAM files (see help(BamViews, package = Rsamtools)
) or work on a very large BAM file in chunks (see help(BamFile, package = "Rsamtools"
)
So let’s see how to import our BAM file using the readGAlignments
from the GenomicAlignments package:
mysrbam2=readGAlignments(sr_bamFile,
param=ScanBamParam(which=which,
what="seq",
flag=flag))
mysrbam2[1:2]
## GAlignments object with 2 alignments and 1 metadata column:
## seqnames strand cigar qwidth start end width
## <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
## [1] chr4 + 21M13615N50M55N4M 75 72990 86734 13745
## [2] chr4 - 4M13615N50M55N21M 75 73007 86751 13745
## njunc | seq
## <integer> | <DNAStringSet>
## [1] 2 | AAAAACTGCA...CGTAGCCACA
## [2] 2 | ATACCTGTGA...TGGACGGCTG
## -------
## seqinfo: 8 sequences from an unspecified genome
Note that we have redefined the ScanBamParam. This is because readGAlignments
comes with predefined fields to import and we just need to add those extra fields we want to import in the ScanBamParam
parameter object.
Here we imported the sequences to show that they are imported as a DNAStringSet
but it is generally not necessary to keep these sequences once the reads have been mapped.
So we don’t keep them for the next steps:
## GAlignments object with 2 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end width
## <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
## [1] chr4 + 21M13615N50M55N4M 75 72990 86734 13745
## [2] chr4 - 4M13615N50M55N21M 75 73007 86751 13745
## njunc
## <integer>
## [1] 2
## [2] 2
## -------
## seqinfo: 8 sequences from an unspecified genome
The object returned is a GAlignments
(see ?'GAlignments-class'
for details and accessors).
These objects are highly similar to GRanges
. They can be accessed with similar functions:
head(start(mysrbam2))
## [1] 72990 73007 73007 73007 73007 73007
head(width(mysrbam2))
## [1] 13745 13745 13745 13745 13745 13745
seqnames(mysrbam2)
## factor-Rle of length 1040 with 1 run
## Lengths: 1040
## Values : chr4
## Levels(8): chr2L chr2R chr3L chr3R chr4 chrM chrX chrYHet
cigar(mysrbam2)[1:3]
## [1] "21M13615N50M55N4M" "4M13615N50M55N21M" "4M13615N50M55N21M"
head(njunc(mysrbam2))
## [1] 2 2 2 2 2 2
They can be converted to GRanges:
## GRanges object with 2 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr4 72990-86734 +
## [2] chr4 73007-86751 -
## -------
## seqinfo: 8 sequences from an unspecified genome
And one can easily access the details of each read alignment as GRanges
organized in a GRangesList
object using:
grglist(mysrbam2)[[1]] #only first read shown element here
## GRanges object with 3 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr4 72990-73010 +
## [2] chr4 86626-86675 +
## [3] chr4 86731-86734 +
## -------
## seqinfo: 8 sequences from an unspecified genome
junctions(mysrbam2)[[1]] #and the corresponding junctions
## GRanges object with 2 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr4 73011-86625 +
## [2] chr4 86676-86730 +
## -------
## seqinfo: 8 sequences from an unspecified genome
See below and in help('junctions-methods',"GenomicAlignments")
, help('findOverlaps-methods',"GenomicAlignments")
and help('coverage-methods',"GenomicAlignments")
for other methods defined on GAlignments
objects.
An example of paired-end data is available in the pasillaBamSubset package:
We can import this data in R using:
## GAlignmentPairs object with 2 pairs, strandMode=1, and 0 metadata columns:
## seqnames strand : ranges -- ranges
## <Rle> <Rle> : <IRanges> -- <IRanges>
## [1] chr4 - : 13711-13747 -- 74403-77053
## [2] chr4 * : 74403-77053 -- 955236-964043
## -------
## seqinfo: 8 sequences from an unspecified genome
The GAlignmentPairs
class holds only read pairs (reads with no mate or with ambiguous pairing are discarded).
The readGalignmentPairs
function has a strandMode argument to specify how to report the strand of a pair.
For stranded protocols, depending how the libraries were generated, strandMode should be set to 1 (the default for e.g. directional Illumina protocol by ligation) or 2 (e.g. for dUTP or Illumina stranded TruSeq PE protocol).
The individual reads can be accessed as GAlignments
objects using:
myprbam[1] #first record
## GAlignmentPairs object with 1 pair, strandMode=1, and 0 metadata columns:
## seqnames strand : ranges -- ranges
## <Rle> <Rle> : <IRanges> -- <IRanges>
## [1] chr4 - : 13711-13747 -- 74403-77053
## -------
## seqinfo: 8 sequences from an unspecified genome
first(myprbam[1]) #first sequenced fragment
## GAlignments object with 1 alignment and 0 metadata columns:
## seqnames strand cigar qwidth start end width
## <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
## [1] chr4 - 37M 37 13711 13747 37
## njunc
## <integer>
## [1] 0
## -------
## seqinfo: 8 sequences from an unspecified genome
last(myprbam[1]) #last sequenced fragment
## GAlignments object with 1 alignment and 0 metadata columns:
## seqnames strand cigar qwidth start end width
## <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
## [1] chr4 + 33M2614N4M 37 74403 77053 2651
## njunc
## <integer>
## [1] 1
## -------
## seqinfo: 8 sequences from an unspecified genome
We could also use the readGAlignmentsList
function which returns both mate-pairs and non-mates in a more classical “list-like” structure (see help('readGAlignmentsList', package="GenomicAlignments")
for examples).
Computing a coverage (number of reads aligning at a given position on the genome) on a GAlignments
is straightforward:
## integer-Rle of length 1351857 with 1266 runs
## Lengths: 72989 17 4 2092 75 ... 5 6175 49 143448
## Values : 0 1 9 0 1 ... 1 0 1 0
The result is an RleList
organized by chromosomes.
We can extract specific regions from this object using a GRanges object.
For example, let’s get the 1kb promoters of genes located within the region that we have defined earlier on Chr4:
#Regions of interest:
which_chr4_gr = GRanges(c("chr4:75000-85000:*",
"chr4:1190000-1203000:*"))
#promoters in these regions
prom_chr4 = subsetByOverlaps(
promoters(genes(TxDb.Dmelanogaster.UCSC.dm3.ensGene), upstream =1000, downstream = 0),
which_chr4_gr)
Now extract the coverage only on these promoters:
prom_chr4_coverage = cvg_sr[prom_chr4]
names(prom_chr4_coverage) = names(prom_chr4)
prom_chr4_coverage
## RleList of length 3
## $FBgn0002521
## integer-Rle of length 1000 with 1 run
## Lengths: 1000
## Values : 0
##
## $FBgn0004859
## integer-Rle of length 1000 with 16 runs
## Lengths: 72 1 74 51 17 29 1 13 17 15 29 14 17 59 75 516
## Values : 1 2 1 0 1 2 3 2 3 4 3 2 1 0 1 0
##
## $FBgn0264617
## integer-Rle of length 1000 with 14 runs
## Lengths: 778 5 2 20 48 5 2 20 5 45 14 16 37 3
## Values : 0 1 2 3 4 3 2 1 0 1 2 3 2 3
We could also use Views to do the same operation but that will only work for a single chromosome:
## Views on a 1351857-length Rle subject
##
## views:
## start end width
## [1] 1202272 1203271 1000 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...]
## [2] 77668 78667 1000 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...]
## [3] 76123 77122 1000 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...]
Note that the coverage
function is not strand specific.
If we want the coverage for the minus strand only, we could use:
## integer-Rle of length 1351857 with 724 runs
## Lengths: 73006 4 3895 75 86 ... 5 6175 49 143448
## Values : 0 8 0 1 0 ... 1 0 1 0
Often, we need to resize or shift the reads when computing the coverage.
This can be done on the GAlignment
object using e.g. resize
or shift
.
For example with resize
(which by default is strand-aware):
## integer-Rle of length 1351857 with 1330 runs
## Lengths: 72989 300 1813 300 157 ... 5 5924 300 143448
## Values : 0 1 0 1 0 ... 1 0 1 0
The shift
option also exist in the coverage
function for GAlignments
:
#shift directly in coverage !not strand-aware!:
coverage(mysrbam2, shift=150)$chr4
## integer-Rle of length 1351857 with 1266 runs
## Lengths: 73139 17 4 2092 75 ... 5 6175 49 143298
## Values : 0 1 9 0 1 ... 1 0 1 0
#shift is not strand-aware but we can make it strand-aware like this:
coverage(mysrbam2, shift=150*as.numeric(paste0(strand(mysrbam2), 1)))$chr4
## Warning in S4Vectors:::V_recycle(arg, x, argname, "x"): 'length(x)' is not a
## multiple of 'NROW(shift)'
## integer-Rle of length 1351857 with 1376 runs
## Lengths: 72856 4 279 17 4 ... 5 6175 49 143298
## Values : 0 4 0 1 5 ... 1 0 1 0
In ChIP-seq experiments, one often look for “peaks” (or broad regions in e.g. ChIP of some chromatin marks or PolII) in the read coverage. This task, often called “peak calling” or “peak finding” can be performed outside of R with one of the numerous “peak callers” available such as MACS, SPP or SICER (for broad regions).
A number of specific and advanced tools also exist in R to perform peak calling such as: chipseq, bumphunter, CSAR, or PICS.
Here, we only illustrate a naive approach to get a list of peaks by applying a single threshold on the coverage object using the slice
function:
## Views on a 1351857-length Rle subject
##
## views:
## start end width
## [1] 80870 80903 34 [60 61 62 63 64 65 65 65 65 64 61 62 63 63 65 65 64 ...]
## [2] 80929 80932 4 [60 60 60 60]
## [3] 80934 80936 3 [60 61 61]
## [4] 80972 81017 46 [61 61 64 64 65 65 65 71 71 71 72 72 72 71 70 70 70 ...]
Another common task is to count how many reads align to a set of genomic features.
This counting operation is sometimes called read summarization or data reduction.
It is typically done in RNA-seq experiments to produce a count matrix for subsequent analysis.
The counts can be obtained on e.g. genes, transcripts or exons depending on the aim of the study.
Some packages such as Rsubread, QuasR or easyRNAseq provide specific functions to perform the counting step.
Here we only present the general function summarizeOverlaps
from GenomicAlignments.
We want to count the reads aligning on the exons. So we first get the exons, organized by genes, which are located in our regions of interest on chr4:
exbygn_chr4 = subsetByOverlaps(exonsBy(TxDb.Dmelanogaster.UCSC.dm3.ensGene,
by="gene"),
which_chr4_gr)
Then we use the summarizeOverlaps
function to obtain the counts (Note that the counting is strand-aware by default!):
count_res = summarizeOverlaps(exbygn_chr4, mysrbam2, mode="Union")
count_res
## class: RangedSummarizedExperiment
## dim: 3 1
## metadata(0):
## assays(1): counts
## rownames(3): FBgn0002521 FBgn0004859 FBgn0264617
## rowData names(0):
## colnames(1): reads
## colData names(2): object records
assays(count_res)$counts
## reads
## FBgn0002521 346
## FBgn0004859 12
## FBgn0264617 117
To count reads from multiple BAM files we just need to enter the path to the BAM files as a character vector:
count_res2=summarizeOverlaps(exbygn_chr4,
c(sr_bamFile,pr_bamFile),
mode="Union")
assays(count_res2)$counts
## untreated1_chr4.bam untreated3_chr4.bam
## FBgn0002521 346 210
## FBgn0004859 200 96
## FBgn0264617 117 53
When counting reads over regions of interest there are different choices to make. Different count modes are available, as illustrated in the documentation of HTSeq
The result of summarizeOverlaps
(here count_res
) is a SummarizedExperiment
object defined in the SummarizedExperiment package and introduced in (Huber et al. 2015).
In aSummarizedExperiment
object, the features (i.e. rows) are ranges of interest and the columns are samples. One or more assays
represented by matrix-like objects can be stored for these ranges x samples, along with metadata for the rows (rowData
), columns (colData
) and the experiments (exptData
), as illustrated below.
There are several options available to perform read counting with summarizeOverlaps
, such as to preprocess the reads before counting.
See ?summarizeOverlaps
}` for details and examples.
The count matrix could then be used for normalization and differential analysis with the DESeq2, edgeR, limma, DEXseq packages for example.
There are a number of annotation packages in Bioconductor which can be browsed in the corresponding BioC View.
Annotations are provided as AnnotationDb
objects and the AnnotationDbi package provides a common interface to these objects.
The vignette of the AnnotationDbi provides a good introduction to the different types of package available, illustrated below.
There are gene-centric packages:
OrgDb
class ; e.g. org.Dm.eg.db)ChipDb
class: e.g. [drosophila2.db(https://bioconductor.org/packages/release/data/annotation/html/drosophila2.db.html) and the corresponding probe (drosophila2probe) and cdf packages (drosophila2cdf)OrthologyDb
class ; e.g. Orthology.eg.db) which replaced the InparanoidDb
packages.And genome-centric packages:
TxDb
class ; e.g. TxDb.Dmelanogaster.UCSC.dm3.ensGene)FeatureDb
class, e.g. FDb.UCSC.tRNAs) that can be created with the GenomicFeatures packageIn addition, OrganismDb
packages were added based on the OrganismDbi package. These packages basically combine all the above packages in a single package. These are available for Homo.sapiens, Mus.musculus and Rattus.norvegicus.
Packages are available in Bioconductor to build your own annotation packages. These include AnnotationForge and GenomicFeatures and txdbmaker
The AnnotationHub package also provides an interface to access a large resource of genomic files, which can be used as extra “annotations”.
To get more information about annotations in Bioconductor, take a look at the annotation workflow.
There are 4 basic functions to access nearly all annotations:
column
function shows the fields from which annotations can be extractedkeytypes
function shows the field from which annotations can be extracted AND which can be used as keys to access these annotationskeys
function retrieves the keys themselves (i.e. values from a ‘keytype’ field used to access the annotation of the corresponding feature)select
function extracts the data from the AnnotationDb
object from the columns specified and providing a set of keys of a given keytype.We illustrate these functions on some annotation packages.
First, we explore the OrgDb
package which contains gene-level annotations:
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "FLYBASE"
## [11] "FLYBASECG" "FLYBASEPROT" "GENENAME" "GENETYPE" "GO"
## [16] "GOALL" "MAP" "ONTOLOGY" "ONTOLOGYALL" "PATH"
## [21] "PMID" "REFSEQ" "SYMBOL" "UNIPROT"
## help("PATH")
## keytypes(org.Dm.eg.db) #same as columns(org.Dm.eg.db) in this case
uniKeys = keys(org.Dm.eg.db, keytype="UNIPROT")[c(13286, 13288)]
cols = c("SYMBOL","GO")
select(org.Dm.eg.db,
keys=uniKeys,
columns=cols,
keytype="UNIPROT")
## UNIPROT SYMBOL GO EVIDENCE ONTOLOGY
## 1 Q94513 BEAF-32 GO:0000785 IDA CC
## 2 Q94513 BEAF-32 GO:0003677 IDA MF
## 3 Q94513 BEAF-32 GO:0003677 IEA MF
## 4 Q94513 BEAF-32 GO:0005515 IPI MF
## 5 Q94513 BEAF-32 GO:0005634 IBA CC
## 6 Q94513 BEAF-32 GO:0005634 IDA CC
## 7 Q94513 BEAF-32 GO:0005705 IDA CC
## 8 Q94513 BEAF-32 GO:0006325 IDA BP
## 9 Q94513 BEAF-32 GO:0006325 IMP BP
## 10 Q94513 BEAF-32 GO:0006357 IBA BP
## 11 Q94513 BEAF-32 GO:0006357 IDA BP
## 12 Q94513 BEAF-32 GO:0043035 IDA MF
## 13 Q94513 BEAF-32 GO:0048477 IMP BP
## 14 Q7JN06 BEAF-32 GO:0000785 IDA CC
## 15 Q7JN06 BEAF-32 GO:0003677 IDA MF
## 16 Q7JN06 BEAF-32 GO:0003677 IEA MF
## 17 Q7JN06 BEAF-32 GO:0005515 IPI MF
## 18 Q7JN06 BEAF-32 GO:0005634 IBA CC
## 19 Q7JN06 BEAF-32 GO:0005634 IDA CC
## 20 Q7JN06 BEAF-32 GO:0005705 IDA CC
## 21 Q7JN06 BEAF-32 GO:0006325 IDA BP
## 22 Q7JN06 BEAF-32 GO:0006325 IMP BP
## 23 Q7JN06 BEAF-32 GO:0006357 IBA BP
## 24 Q7JN06 BEAF-32 GO:0006357 IDA BP
## 25 Q7JN06 BEAF-32 GO:0043035 IDA MF
## 26 Q7JN06 BEAF-32 GO:0048477 IMP BP
The GO.db package can be used to retrieve information on the identified Gene Ontology categories (chosen also based on Evidence codes):
## library(GO.db)
mygos=c("GO:0006325","GO:0006357", "GO:0048477")
select(GO.db,
columns=columns(GO.db)[2:4],
keys=mygos,
keytype="GOID")
## 'select()' returned 1:1 mapping between keys and columns
## GOID ONTOLOGY TERM
## 1 GO:0006325 BP chromatin organization
## 2 GO:0006357 BP regulation of transcription by RNA polymerase II
## 3 GO:0048477 BP oogenesis
We can also extract a whole table as a data.frame
:
## go_id Term Ontology
## 1 GO:0000001 mitochondrion inheritance BP
## 2 GO:0000002 mitochondrial genome maintenance BP
## 3 GO:0000006 high-affinity zinc transmembrane transporter activity MF
We can also search for a specific pattern in the keys:
## [1] "EcR" "DopEcR"
select(org.Dm.eg.db,
keys=c("EcR","DopEcR"),
columns=c("ENTREZID","FLYBASE","GENENAME"),
keytype="SYMBOL")
## 'select()' returned 1:1 mapping between keys and columns
## SYMBOL ENTREZID FLYBASE GENENAME
## 1 EcR 35540 FBgn0000546 Ecdysone receptor
## 2 DopEcR 38539 FBgn0035538 Dopamine/Ecdysteroid receptor
Finally, TxDb
packages can also be interrogated using the same methods:
select(TxDb.Dmelanogaster.UCSC.dm3.ensGene,
columns = c("TXID", "TXCHROM", "TXSTRAND", "TXSTART", "TXEND"),
keys = c("FBgn0039183", "FBgn0264342", "FBgn0030583"),
keytype = 'GENEID')
## 'select()' returned 1:1 mapping between keys and columns
## GENEID TXID TXCHROM TXSTRAND TXSTART TXEND
## 1 FBgn0039183 19327 chr3R + 20159145 20162971
## 2 FBgn0264342 559 chr2L + 4647051 4647920
## 3 FBgn0030583 25509 chrX + 14730758 14731363
The main functionality of the rtracklayer package (Lawrence, Gentleman, and Carey 2009) is to import/export genomic files in a number of different standard formats.
These include GFF, BED, wig, bigWig and bedGraph.
Additionally, rtracklayer also provides functions to interact with genome browsers and in particular the UCSC genome browser (not illustrated here).
rtracklayer has an import
and an export
function which will work on most files with their standard file extension.
In addition, depending on the file type, other options will be available, for example to import data only from specific genomic regions or to import the object as an RleList
or a GRanges
object.
The AnnotationHub allows to easily retrieve a number of public datasets and annotation tracks as standard Bioconductor objects such as GRanges
and GRangesList
.
Note however that many datasets have been pre-processed so the user relies on others for these steps. The datasets available come from e.g. ENSEMBL, NCBI, UCSC, ENCODE or Inparanoid.
We will not illustrate the usage of this package but if you’re looking for a specific dataset or instead to collect many datasets on your favorite model/tissue/etc. then it is worth exploring the vignettes of the AnnotationHub package.