genomation: summary of genomic intervals

  • 4,066 views
Uploaded on

an R package that contains a collection of tools for visualizing and analyzing genome-wide data sets. The package works with a variety of genomic interval file types and enables easy summarization and …

an R package that contains a collection of tools for visualizing and analyzing genome-wide data sets. The package works with a variety of genomic interval file types and enables easy summarization and annotation of high throughput data sets with given genomic annotations. http://al2na.github.io/genomation/

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
4,066
On Slideshare
0
From Embeds
0
Number of Embeds
64

Actions

Shares
Downloads
13
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. genomation package genomation Altuna Akalın Usage and ubiquity of genomic interval summaries a toolkit to summarize, annotate and visualize genomic intervals Using genomation Altuna Akalın1 More information February 24, 2014 1* presented by. Package developed by Altuna Akalın and Vedran Franke
  • 2. Quick introduction genomation package Altuna Akalın Usage and ubiquity of genomic interval summaries Using genomation More information The genomation is an R package that expedites genomic interval summary and annotation. It has the following features 1 Annotation of genomic intervals: e.g. see what % of your intervals overlap with exon/intron/promoters
  • 3. Quick introduction genomation package Altuna Akalın Usage and ubiquity of genomic interval summaries Using genomation More information The genomation is an R package that expedites genomic interval summary and annotation. It has the following features 1 2 Annotation of genomic intervals: e.g. see what % of your intervals overlap with exon/intron/promoters Summary of genomic scores or read coverages over pre-defined regions e.g. extract the conservation profile over ChIP-seq binding sites (equi-width regions) or CpG islands (nonequi-width regions)
  • 4. Quick introduction genomation package Altuna Akalın Usage and ubiquity of genomic interval summaries Using genomation The genomation is an R package that expedites genomic interval summary and annotation. It has the following features 1 2 More information Annotation of genomic intervals: e.g. see what % of your intervals overlap with exon/intron/promoters Summary of genomic scores or read coverages over pre-defined regions e.g. extract the conservation profile over ChIP-seq binding sites (equi-width regions) or CpG islands (nonequi-width regions) 3 Visualize genomic interval summaries as meta-region plots or heatmaps.
  • 5. Quick introduction genomation package Altuna Akalın Usage and ubiquity of genomic interval summaries Using genomation The genomation is an R package that expedites genomic interval summary and annotation. It has the following features 1 2 More information Annotation of genomic intervals: e.g. see what % of your intervals overlap with exon/intron/promoters Summary of genomic scores or read coverages over pre-defined regions e.g. extract the conservation profile over ChIP-seq binding sites (equi-width regions) or CpG islands (nonequi-width regions) 3 4 Visualize genomic interval summaries as meta-region plots or heatmaps. Work with multiple file formats e.g. BAM, BED, bigWig, GFF and generic tabular text files containing chromosome location information.
  • 6. Quick introduction genomation package Altuna Akalın Usage and ubiquity of genomic interval summaries Using genomation The genomation is an R package that expedites genomic interval summary and annotation. It has the following features 1 2 More information Annotation of genomic intervals: e.g. see what % of your intervals overlap with exon/intron/promoters Summary of genomic scores or read coverages over pre-defined regions e.g. extract the conservation profile over ChIP-seq binding sites (equi-width regions) or CpG islands (nonequi-width regions) 3 4 Visualize genomic interval summaries as meta-region plots or heatmaps. Work with multiple file formats e.g. BAM, BED, bigWig, GFF and generic tabular text files containing chromosome location information. 5 do all these in R :)
  • 7. Genomic interval summaries are widely used genomation package Altuna Akalın Usage and ubiquity of genomic interval summaries Using genomation More information Summaries of genomic intervals are one of the useful ways to communicate high-dimensional data Traditionally, regions of interest are picked and distribution of genomic intervals are summarized on those regions
  • 8. Genomic interval summaries are widely used: Examples from literature genomation package Altuna Akalın Usage and ubiquity of genomic interval summaries Using genomation More information Figure : Erkek, S., et al. (2013). Molecular determinants of nucleosome retention at CpG-rich sequences in mouse spermatozoa. Nature Structural & Molecular Biology
  • 9. Genomic interval summaries are widely used: Examples from literature genomation package Altuna Akalın Usage and ubiquity of genomic interval summaries Using genomation More information Figure : Stadler, M., Murr, R., Burger, L., et al. (2011). DNA-binding factors shape the mouse methylome at distal regulatory regions. Nature
  • 10. Utility and futility of average profiles genomation package average profile around anchor 4.5 average score More information 5.0 Using genomation 4.0 Usage and ubiquity of genomic interval summaries Does this mean all of the windows (viewpoints) have a similar enrichment profile? 3.5 Altuna Akalın −100 0 bases 50 100
  • 11. Utility and futility of average profiles genomation package Altuna Akalın Usage and ubiquity of genomic interval summaries Only 1/3 of windows have such enrichment. Be careful when you are interpreting the average profiles. 21 Using genomation 0 5.2 10 16 More information −100 0 50 100
  • 12. Genomic interval summaries are widely used: Examples from literature genomation package Altuna Akalın Usage and ubiquity of genomic interval summaries Using genomation More information Figure : Lister, R., et al. (2009). Human DNA methylomes at base resolution show widespread epigenomic differences. Nature
  • 13. Genomic interval summaries are widely used: Examples from literature genomation package Altuna Akalın Usage and ubiquity of genomic interval summaries Using genomation More information Figure : Feng, S. et al. (2010). Conservation and divergence of methylation patterning in plants and animals. PNAS
  • 14. Issues to keep in mind when developing summary methods genomation package Altuna Akalın Usage and ubiquity of genomic interval summaries Using genomation More information Genomic data comes in many formats, we need a method that is able to work with multiple flat file formats We need a method that is not specialized on one type of data set such as read counts, it should also work on other scoring schemes(e.g. conservation scores) easily. Regions of interest are not always equi-width, you should be able to normalize for length differences by binning. Multiple visualization options and fast heatmap generation should be available Clustering of regions based on multiple summaries (e.g. binding for different TFs on the same set of regions) on the heatmap Ease of use, it should not take hours of coding to generate and visualize summaries.
  • 15. Overview of genomation features genomation package Altuna Akalın region 2 region 3 region 4 ... region m 1.1 1.0 0.8 0.86 TF4 TF3 0.6 0.6 TF2 0.34 TF1 0.072 . . . . . . . . . . . TF1 0 0.2 region 1 TF3 TF2 0.4 read per million 1 2 3 4 ... n 0 500 1000 base-pairs around anchor heatmaps for genomic interval sets TF 4 TF 3 TF 2 TF 1 Visualize 0 0.5 1 1.5 2 2.5 0 100 0 0 0 0 0.5 1 1.5 2 2.5 500 500 100 0 0 100 0 0 0.5 1 1.5 2 0 Annotate 500 Annotation BED GFF Tab txt GRanges 500 1000 base-pairs around anchor 500 More information Genomic Intervals BAM BigWig BED GFF Summarize Tab txt GRanges meta-region heatmaps TF4 100 Using genomation meta-region plots ScoreMatrix/ScoreMatrixList object Base-pairs/ bins 0.0 Usage and ubiquity of genomic interval summaries 0 0.5 1 1.5 2 2.5 Piecharts for annotation 25.7 21.8 11.6 40.9 Intergenic Intron Exon Promoter
  • 16. installation of the package and the example data genomation package Altuna Akalın Usage and ubiquity of genomic interval summaries Using genomation More information We can install the package and the data using install_github() function from the devtools package. #install dependencies install.packages( c("data.table","plyr","reshape2","ggplot2", "gridBase","devtools")) source("http://bioconductor.org/biocLite.R") biocLite(c("GenomicRanges","rtracklayer","impute","Rsamtools")) # install the packages library(devtools) install_github("genomation", username = "al2na") # install the data package # needed for examples install_github("genomationData", username = "al2na")
  • 17. Data import genomation package Altuna Akalın Various file formats can be used in genomation. You can read in annotation or your genomic intervals of interest. Usage and ubiquity of genomic interval summaries library(genomation) tab.file1 <- system.file("extdata/tab1.bed", package = "genomation") readGeneric(tab.file1) Using genomation More information ## GRanges with 6 ## seqnames ## <Rle> ## [1] chr21 ## [2] chr21 ## [3] chr21 ## [4] chr21 ## [5] chr21 ## [6] chr21 ## --## seqlengths: ## chr21 ## NA ranges and 0 metadata columns: ranges strand <IRanges> <Rle> [9437272, 9439473] * [9483485, 9484663] * [9647866, 9648116] * [9708935, 9709231] * [9825442, 9826296] * [9909011, 9909218] *
  • 18. Extraction of data over pre-defined genomic regions genomation package Altuna Akalın Usage and ubiquity of genomic interval summaries ScoreMatrix() and ScoreMatrixBin() are functions used to extract data over predefined windows. ScoreMatrix is used when all of the windows have the same width (e.g. region around TSS) Using genomation ScoreMatrixBin is designed for use with windows of unequal width (e.g. enrichment of methylation over exons). More information data(cage) data(promoters) sm <- ScoreMatrix(target = cage, windows = promoters) sm ## scoreMatrix with dims: 1055 2001
  • 19. Visualizing ScoreMatrix: summary of genomic invervals over pre-defined regions genomation package Altuna Akalın Usage and ubiquity of genomic interval summaries Using genomation plotMeta(),heatMeta(), heatMatrix() and multiHeatMatrix() are the visualization functions. oldmar <- par()$mar par(oma = c(0, 0, 0, 0)) heatMatrix(sm, xcoords = c(-1000, 1000)) plotMeta(sm, xcoords = c(-1000, 1000),line.col="blue") par(oma = oldmar) 0.15 average score 0.10 0.05 0.00 0 0.75 1.5 2.2 3 0.20 0.25 More information −1000 −500 0 500 1000 −1000 −500 0 bases 500 1000
  • 20. Working with BAM files genomation package Altuna Akalın Usage and ubiquity of genomic interval summaries Using genomation More information BAM files can also be used in ScoreMatrix() and ScoreMatrixBin() functions bam.file = system.file('tests/test.bam', package='genomation') windows = GRanges(rep(c(1,2),each=2), IRanges(rep(c(1,2), times=2), width=5)) scores3 = ScoreMatrix(target=bam.file,windows=windows, type='bam')
  • 21. Working with bigWig files DHS around TSS 14 More information 8 10 Using genomation my.bed12.file=system.file("extdata/chr21.refseq.hg19.bed", package = "genomation") feats=readTranscriptFeatures(my.bed12.file,up.flank=500,down.flank=500) sm=ScoreMatrix(target="wgEncodeUwDnaseA549RawRep1.bw", windows=feats$promoters,type='bigWig',strand.aware=TRUE) plotMeta(sm,xcoords=c(-500,500),main="DHS around TSS",line.col="blue") average score Usage and ubiquity of genomic interval summaries 6 Altuna Akalın ScoreMatrix() and ScoreMatrixBin() are functions can handle bigWig files. Here we use ENCODE DHS scores, downloaded from http://goo.gl/fEVu0g 4 genomation package −400 0 200
  • 22. Multiple profiles 0 0 0 50 25 0 25 −2 50 50 −5 0 00 0 0 25 −2 50 50 −5 0 00 0 0 25 0 P300 Suz12 Rad21 Znf143 −2 50 0 0 −5 00 CTCF 50 −5 0 00 More information 25 Using genomation ctcf.peaks=readRDS("ctcf.peaks.rds") dataPath = system.file("extdata", package = "genomationData") bam.files = list.files(dataPath, full= T,pattern = "bam$")[c(1:4,6)] sml = ScoreMatrixList(bam.files, ctcf.peaks, bin.num = 50,type = "bam") names(sml)=c("CTCF","P300","Suz12","Rad21","Znf143") multiHeatMatrix(sml, xcoords = c(-500, 500),cex.axis=0.35,common.scale = T, col = c("lightgray", "blue"),winsorize=c(0,95)) −2 50 Usage and ubiquity of genomic interval summaries 50 −5 0 00 Altuna Akalın Multiple heatmap profiles can be plotted using multiHeatMatrix() which takes in a ScoreMatrixList object. Here we used CTCF , P300 , Suz12 ,Rad21, Znf143 BAM files from genomationData package. −2 50 genomation package 02468 02468 02468 02468 02468
  • 23. Multiple profiles genomation package Altuna Akalın Usage and ubiquity of genomic interval summaries Using genomation multiHeatMatrix() can also apply K-means clustering. Extreme values are trimmed using with “winsorize” argument multiHeatMatrix(sml, xcoords = c(-500, 500),kmeans=TRUE,k=3,common.scale = T, cex.axis=0.4,col = c("lightgray", "blue"),winsorize=c(0,95)) CTCF P300 Suz12 Rad21 Znf143 1 More information 2 0 0 0 50 25 0 25 0 −500 50 0 −2 50 0 25 0 −500 50 0 −2 50 0 25 0 −500 50 0 −2 50 0 50 0 −500 50 0 −2 50 25 −2 −5 00 3 02468 02468 02468 02468 02468
  • 24. Multiple profiles genomation package Altuna Akalın Usage and ubiquity of genomic interval summaries Using genomation Multiple average profiles can be visualized with heatMeta(). Here, we also apply a scaling function to all the matrices. # take log2 of all matrices sml2=scaleScoreMatrixList(sml,scalefun=function(x) log2(x+1)) heatMeta(sml2,legend.name="average profiles",xcoords=c(-500, 500), xlab="bp around peaks") More information 1.8 CTCF average profiles 0.61 1 1.4 P300 Suz12 0.21 Rad21 Znf143 −400 −200 0 bp around peaks 200 400
  • 25. Multiple profiles genomation package Altuna Akalın Usage and ubiquity of genomic interval summaries Multiple average profiles can also be visualized with plotMeta() plotMeta(sml2,profile.names=names(sml2), xcoords=c(-500, 500), main="mult. profiles") Using genomation mult. profiles More information 1.0 0.5 average score 1.5 CTCF P300 Suz12 Rad21 Znf143 −400 −200 0 bases 200 400
  • 26. Future work... genomation package Altuna Akalın Usage and ubiquity of genomic interval summaries Using genomation More information Explore overlap statistics between two genomic data sets: Does TF1 binding site locations overlap with TF2 sites more than expected? This is previously explored with GenometriCorr package. These functionality can be included in the form of a dependency. Performance improvement on certain functions, faster is always better...
  • 27. Further information genomation package Altuna Akalın Usage and ubiquity of genomic interval summaries Using genomation More information The genomation package is available at http:/al2na.github.io/genomation. You can find the link to the vignette on the webpage as well. Code that generated this presentation is available at http://github.com/al2na/genomation_presentation Questions and bug reports You can view/open issues in github https://github.com/al2na/genomation/issues?state=open You can ask questions by sending an e-mail to genomation@googlegroups.com or using the web interface to google groups Developed by Altuna Akalın and Vedran Franke
  • 28. Session Info genomation package Altuna Akalın Usage and ubiquity of genomic interval summaries Using genomation More information sessionInfo() ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## R version 3.0.2 (2013-09-25) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] C attached base packages: [1] methods grid stats [8] base graphics grDevices utils other attached packages: [1] genomation_0.99.0.2 knitr_1.5 loaded via a namespace (and not attached): [1] BSgenome_1.30.0 BiocGenerics_0.8.0 [4] GenomicRanges_1.14.3 IRanges_1.20.5 [7] RColorBrewer_1.0-5 RCurl_1.95-4.1 [10] XML_3.95-0.2 XVector_0.2.0 [13] colorspace_1.2-4 data.table_1.8.10 [16] digest_0.6.3 evaluate_0.5.1 [19] ggplot2_0.9.3.1 gridBase_0.4-6 Biostrings_2.30.0 MASS_7.3-29 Rsamtools_1.14.1 bitops_1.0-6 dichromat_2.0-0 formatR_0.10 gtable_0.1.2 datas