genomation: summary of genomic intervals

genomation
package

genomation

Altuna Akalın
Usage and
ubiquity of
genomic
interval
summaries

a toolkit to summarize, annotate and visualize genomic intervals

Using
genomation

Altuna Akalın1

More
information

February 24, 2014
1*

presented by. Package developed by Altuna Akalın and Vedran Franke

Quick introduction
genomation
package
Altuna Akalın
Usage and
ubiquity of
genomic
interval
summaries
Using
genomation
More
information

The genomation is an R package that expedites genomic
interval summary and annotation. It has the following features
1

Annotation of genomic intervals: e.g. see what % of your
intervals overlap with exon/intron/promoters

Quick introduction
genomation
package
Altuna Akalın
Usage and
ubiquity of
genomic
interval
summaries
Using
genomation
More
information

1

2

Summary of genomic scores or read coverages over pre-deﬁned
regions
e.g. extract the conservation proﬁle over ChIP-seq binding sites
(equi-width regions) or CpG islands (nonequi-width regions)

Quick introduction
genomation
package
Altuna Akalın
Usage and
ubiquity of
genomic
interval
summaries
Using
genomation

1

2

More
information

regions

3

Visualize genomic interval summaries as meta-region plots or
heatmaps.

Quick introduction
genomation
package
Altuna Akalın
Usage and
ubiquity of
genomic
interval
summaries
Using
genomation

1

2

More
information

regions

3

4

heatmaps.
Work with multiple ﬁle formats
e.g. BAM, BED, bigWig, GFF and generic tabular text ﬁles
containing chromosome location information.

Quick introduction
genomation
package
Altuna Akalın
Usage and
ubiquity of
genomic
interval
summaries
Using
genomation

1

2

More
information

regions

3

4

heatmaps.
Work with multiple ﬁle formats
e.g. BAM, BED, bigWig, GFF and generic tabular text ﬁles
containing chromosome location information.

5

do all these in R :)

Genomic interval summaries are widely used
genomation
package
Altuna Akalın
Usage and
ubiquity of
genomic
interval
summaries
Using
genomation
More
information

Summaries of genomic intervals are one of the useful ways to
communicate high-dimensional data
Traditionally, regions of interest are picked and distribution of
genomic intervals are summarized on those regions

Genomic interval summaries are widely used:
Examples from literature
genomation
package
Altuna Akalın
Usage and
ubiquity of
genomic
interval
summaries
Using
genomation
More
information

Figure : Erkek, S., et al. (2013). Molecular determinants of nucleosome
retention at CpG-rich sequences in mouse spermatozoa. Nature Structural
& Molecular Biology

genomation
package
Altuna Akalın
Usage and
ubiquity of
genomic
interval
summaries
Using
genomation
More
information

Figure : Stadler, M., Murr, R., Burger, L., et al. (2011). DNA-binding
factors shape the mouse methylome at distal regulatory regions. Nature

Utility and futility of average proﬁles
genomation
package

average profile around anchor

4.5

average score

More
information

5.0

Using
genomation

4.0

Usage and
ubiquity of
genomic
interval
summaries

Does this mean all of the windows (viewpoints) have a similar
enrichment proﬁle?

3.5

Altuna Akalın

−100

0
bases

50

100

Utility and futility of average proﬁles
genomation
package
Altuna Akalın
Usage and
ubiquity of
genomic
interval
summaries

Only 1/3 of windows have such enrichment. Be careful when you are
interpreting the average proﬁles.

21

Using
genomation

0

5.2

10

16

More
information

−100

0

50 100

genomation
package
Altuna Akalın
Usage and
ubiquity of
genomic
interval
summaries
Using
genomation
More
information

Figure : Lister, R., et al. (2009). Human DNA methylomes at base
resolution show widespread epigenomic diﬀerences. Nature

genomation
package
Altuna Akalın
Usage and
ubiquity of
genomic
interval
summaries
Using
genomation
More
information

Figure : Feng, S. et al. (2010). Conservation and divergence of
methylation patterning in plants and animals. PNAS

Issues to keep in mind when developing summary
methods
genomation
package
Altuna Akalın
Usage and
ubiquity of
genomic
interval
summaries
Using
genomation
More
information

Genomic data comes in many formats, we need a method that is
able to work with multiple flat file formats
We need a method that is not specialized on one type of data
set such as read counts, it should also work on other scoring
schemes(e.g. conservation scores) easily.
Regions of interest are not always equi-width, you should be able
to normalize for length differences by binning.
Multiple visualization options and fast heatmap generation
should be available
Clustering of regions based on multiple summaries (e.g.
binding for different TFs on the same set of regions) on the
heatmap
Ease of use, it should not take hours of coding to generate and
visualize summaries.

Overview of genomation features
genomation
package
Altuna Akalın

region 2
region 3
region 4
...
region m

1.1

1.0
0.8

0.86

TF4
TF3

0.6

0.6

TF2
0.34

TF1

0.072

.
.
.
.
. . . . . .
.

TF1

0

0.2

region 1

TF3
TF2

0.4

read per million

1 2 3 4 ... n

0

500

1000

base-pairs around anchor

heatmaps for genomic interval sets
TF 4

TF 3

TF 2

TF 1

Visualize

0 0.5 1 1.5 2 2.5

0
100

0

0

0

0 0.5 1 1.5 2 2.5

500

500

100

0

0
100

0

0 0.5 1 1.5 2

0

Annotate
500

Annotation
BED
GFF
Tab txt
GRanges

500

1000

base-pairs around anchor

500

More
information

Genomic Intervals
BAM
BigWig
BED
GFF
Summarize
Tab txt
GRanges

meta-region heatmaps

TF4

100

Using
genomation

meta-region plots

ScoreMatrix/ScoreMatrixList object
Base-pairs/ bins

0.0

Usage and
ubiquity of
genomic
interval
summaries

0 0.5 1 1.5 2 2.5

Piecharts for annotation

25.7

21.8
11.6

40.9

Intergenic
Intron
Exon
Promoter

installation of the package and the example data
genomation
package
Altuna Akalın
Usage and
ubiquity of
genomic
interval
summaries
Using
genomation
More
information

We can install the package and the data using install_github()
function from the devtools package.
#install dependencies
install.packages( c("data.table","plyr","reshape2","ggplot2",
"gridBase","devtools"))
source("http://bioconductor.org/biocLite.R")
biocLite(c("GenomicRanges","rtracklayer","impute","Rsamtools"))
# install the packages
library(devtools)
install_github("genomation", username = "al2na")
# install the data package
# needed for examples
install_github("genomationData", username = "al2na")

Data import
genomation
package
Altuna Akalın

Various ﬁle formats can be used in genomation. You can read in
annotation or your genomic intervals of interest.

Usage and
ubiquity of
genomic
interval
summaries

library(genomation)
tab.file1 <- system.file("extdata/tab1.bed", package = "genomation")
readGeneric(tab.file1)

Using
genomation
More
information

## GRanges with 6
##
seqnames
##
<Rle>
##
[1]
chr21
##
[2]
chr21
##
[3]
chr21
##
[4]
chr21
##
[5]
chr21
##
[6]
chr21
##
--##
seqlengths:
##
chr21
##
NA

ranges and 0 metadata columns:
ranges strand
<IRanges> <Rle>
[9437272, 9439473]
*
[9483485, 9484663]
*
[9647866, 9648116]
*
[9708935, 9709231]
*
[9825442, 9826296]
*
[9909011, 9909218]
*

Extraction of data over pre-deﬁned genomic regions
genomation
package
Altuna Akalın
Usage and
ubiquity of
genomic
interval
summaries

ScoreMatrix() and ScoreMatrixBin() are functions used to extract
data over predeﬁned windows.
ScoreMatrix is used when all of the windows have the same
width (e.g. region around TSS)

Using
genomation

ScoreMatrixBin is designed for use with windows of unequal
width (e.g. enrichment of methylation over exons).

More
information

data(cage)
data(promoters)
sm <- ScoreMatrix(target = cage, windows = promoters)
sm
##

scoreMatrix with dims:

1055 2001

Visualizing ScoreMatrix: summary of genomic
invervals over pre-deﬁned regions
genomation
package
Altuna Akalın
Usage and
ubiquity of
genomic
interval
summaries
Using
genomation

plotMeta(),heatMeta(), heatMatrix() and multiHeatMatrix()
are the visualization functions.
oldmar <- par()$mar
par(oma = c(0, 0, 0, 0))
heatMatrix(sm, xcoords = c(-1000, 1000))
plotMeta(sm, xcoords = c(-1000, 1000),line.col="blue")
par(oma = oldmar)

0.15

average score

0.10
0.05
0.00

0

0.75

1.5

2.2

3

0.20

0.25

More
information

−1000

−500

0

500

1000

−1000

−500

0
bases

500

1000

Working with BAM ﬁles
genomation
package
Altuna Akalın
Usage and
ubiquity of
genomic
interval
summaries
Using
genomation
More
information

BAM ﬁles can also be used in ScoreMatrix() and ScoreMatrixBin()
functions
bam.file = system.file('tests/test.bam', package='genomation')
windows = GRanges(rep(c(1,2),each=2),
IRanges(rep(c(1,2), times=2), width=5))
scores3 = ScoreMatrix(target=bam.file,windows=windows, type='bam')

Working with bigWig ﬁles

DHS around TSS

14

More
information

8 10

Using
genomation

my.bed12.file=system.file("extdata/chr21.refseq.hg19.bed",
package = "genomation")
feats=readTranscriptFeatures(my.bed12.file,up.flank=500,down.flank=500)
sm=ScoreMatrix(target="wgEncodeUwDnaseA549RawRep1.bw",
windows=feats$promoters,type='bigWig',strand.aware=TRUE)
plotMeta(sm,xcoords=c(-500,500),main="DHS around TSS",line.col="blue")

average score

Usage and
ubiquity of
genomic
interval
summaries

6

Altuna Akalın

ScoreMatrix() and ScoreMatrixBin() are functions can handle
bigWig ﬁles. Here we use ENCODE DHS scores, downloaded from
http://goo.gl/fEVu0g

4

genomation
package

−400

0

200

Multiple profiles

0

0

0

50

25

0
25

−2
50

50
−5 0
00

0

0

25

−2
50

50
−5 0
00

0

0

25

0

P300 Suz12 Rad21 Znf143

−2
50

0

0

−5
00

CTCF

50
−5 0
00

More
information

25

Using
genomation

ctcf.peaks=readRDS("ctcf.peaks.rds")
dataPath = system.file("extdata", package = "genomationData")
bam.files = list.files(dataPath, full= T,pattern = "bam$")[c(1:4,6)]
sml = ScoreMatrixList(bam.files, ctcf.peaks, bin.num = 50,type = "bam")
names(sml)=c("CTCF","P300","Suz12","Rad21","Znf143")
multiHeatMatrix(sml, xcoords = c(-500, 500),cex.axis=0.35,common.scale = T,
col = c("lightgray", "blue"),winsorize=c(0,95))

−2
50

Usage and
ubiquity of
genomic
interval
summaries

50
−5 0
00

Altuna Akalın

Multiple heatmap profiles can be plotted using multiHeatMatrix()
which takes in a ScoreMatrixList object. Here we used CTCF , P300
, Suz12 ,Rad21, Znf143 BAM files from genomationData package.

−2
50

genomation
package

02468 02468 02468 02468 02468

Multiple proﬁles
genomation
package
Altuna Akalın
Usage and
ubiquity of
genomic
interval
summaries
Using
genomation

multiHeatMatrix() can also apply K-means clustering. Extreme
values are trimmed using with “winsorize” argument
multiHeatMatrix(sml, xcoords = c(-500, 500),kmeans=TRUE,k=3,common.scale = T,
cex.axis=0.4,col = c("lightgray", "blue"),winsorize=c(0,95))
CTCF

P300 Suz12 Rad21 Znf143

1

More
information
2

0

0

0

50

25

0

25
0
−500
50
0
−2
50

0

25
0
−500
50
0
−2
50

0

25
0
−500
50
0
−2
50

0

50

0
−500
50
0
−2
50

25

−2

−5

00

3

02468 02468 02468 02468 02468

Multiple proﬁles
genomation
package
Altuna Akalın
Usage and
ubiquity of
genomic
interval
summaries
Using
genomation

Multiple average proﬁles can be visualized with heatMeta(). Here,
we also apply a scaling function to all the matrices.
# take log2 of all matrices
sml2=scaleScoreMatrixList(sml,scalefun=function(x) log2(x+1))
heatMeta(sml2,legend.name="average profiles",xcoords=c(-500, 500),
xlab="bp around peaks")

More
information

1.8

CTCF

average profiles
0.61
1
1.4

P300

Suz12

0.21

Rad21

Znf143

−400

−200

0
bp around peaks

200

400

Multiple proﬁles
genomation
package
Altuna Akalın
Usage and
ubiquity of
genomic
interval
summaries

Multiple average proﬁles can also be visualized with plotMeta()
plotMeta(sml2,profile.names=names(sml2),
xcoords=c(-500, 500),
main="mult. profiles")

Using
genomation

mult. profiles

More
information

1.0
0.5

average score

1.5

CTCF
P300
Suz12
Rad21
Znf143

−400

−200

0
bases

200

400

Future work...
genomation
package
Altuna Akalın
Usage and
ubiquity of
genomic
interval
summaries
Using
genomation
More
information

Explore overlap statistics between two genomic data sets: Does
TF1 binding site locations overlap with TF2 sites more than
expected?
This is previously explored with GenometriCorr package. These
functionality can be included in the form of a dependency.
Performance improvement on certain functions, faster is always
better...

Further information
genomation
package
Altuna Akalın
Usage and
ubiquity of
genomic
interval
summaries
Using
genomation
More
information

The genomation package is available at
http:/al2na.github.io/genomation. You can ﬁnd the link
to the vignette on the webpage as well.
Code that generated this presentation is available at
http://github.com/al2na/genomation_presentation
Questions and bug reports
You can view/open issues in github
https://github.com/al2na/genomation/issues?state=open
You can ask questions by sending an e-mail to
genomation@googlegroups.com or using the web interface to
google groups

Developed by Altuna Akalın and Vedran Franke

Session Info
genomation
package
Altuna Akalın
Usage and
ubiquity of
genomic
interval
summaries
Using
genomation
More
information

sessionInfo()
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] C
attached base packages:
[1] methods
grid
stats
[8] base

graphics

grDevices utils

other attached packages:
[1] genomation_0.99.0.2 knitr_1.5
loaded via a namespace (and not attached):
[1] BSgenome_1.30.0
BiocGenerics_0.8.0
[4] GenomicRanges_1.14.3 IRanges_1.20.5
[7] RColorBrewer_1.0-5
RCurl_1.95-4.1
[10] XML_3.95-0.2
XVector_0.2.0
[13] colorspace_1.2-4
data.table_1.8.10
[16] digest_0.6.3
evaluate_0.5.1
[19] ggplot2_0.9.3.1
gridBase_0.4-6

Biostrings_2.30.0
MASS_7.3-29
Rsamtools_1.14.1
bitops_1.0-6
dichromat_2.0-0
formatR_0.10
gtable_0.1.2

datas

genomation: summary of genomic intervals

Recommended

Recommended

More Related Content

Similar to genomation: summary of genomic intervals

Similar to genomation: summary of genomic intervals (20)

More from Altuna Akalin

More from Altuna Akalin (8)

Recently uploaded

Recently uploaded (20)

genomation: summary of genomic intervals