F07-Cloud-Hadoop-BAM

1,714 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,714
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
20
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

F07-Cloud-Hadoop-BAM

  1. 1. Hadoop-BAM: Directly manipulating BAM on Hadoop Aleksi Kallio CSC - IT Center for Science, Finland BOSC 2011, July 16, Vienna
  2. 2. Background Chipster 2.0: seamless integration of analysis tools, computing clusters and visualizations through a user friendly interface With NGS data, the ”seamless” part gets really hard... Use Hadoop to improve user experience Hadoop-BAM: small side product that might prove to be useful for quite many people
  3. 3. Problem definition <ul><li>Because of NGS instruments, we are in the middle of data deluge </li></ul><ul><li>BAM (Binary Alignment/Map) files are a standardized and compact way of storing (aligned) reads [Samtools] </li></ul><ul><li>So, what does ”data deluge” mean? </li></ul><ul><ul><li>“ Data deluge is a situation where one desperately tries to find space for yet another huge set of BAM (and fastq) files.” </li></ul></ul>
  4. 4. Problem definition (it gets worse...) You don't only need to store data, but you also have to do something with it Pipelines take a long time to run And in real life you don't use your pipelines once, but often tweak and rerun and rerun...
  5. 5. Enter: Hadoop Map-reduce is a framework for processing terabytes of data in a distributed way Hadoop is an open source implementation of the Google's map-reduce framework NGS data has a lot in common with web logs, which were the original motivation for map-reduce
  6. 6. Map-reduce framework
  7. 7. Hadoop and map-reduce <ul><li>The framework basically implements a distributed sorting algorithm </li></ul><ul><li>User has to write “map” and ”reduce” functions, nothing else </li></ul><ul><li>The framework does automatic parallelization and fault tolerance </li></ul><ul><li>But BAM is not Hadoop friendly: </li></ul><ul><ul><li>Binary record format </li></ul></ul><ul><ul><li>BGZF compression on top of that </li></ul></ul>
  8. 8. Possible solutions <ul><li>Implement your own map-reduce framework </li></ul><ul><ul><li>Ouch... </li></ul></ul><ul><li>Convert to Hadoop-friendly text format </li></ul><ul><ul><li>Storage size blows up </li></ul></ul><ul><ul><li>Network speed would become a bottleneck </li></ul></ul><ul><li>Find a way to cope with BAM files in Hadoop </li></ul><ul><ul><li>So we have Hadoop-BAM </li></ul></ul>
  9. 9. Hadoop-BAM Small and simple Java library Throw it into your Hadoop installation BAM! Your BAM files are accessible by Hadoop map-reduce functions
  10. 10. What does it do? Gives you Picard SAM API Hadoop splits data into chunks and special care is needed at chunk boundaries Hadoop-BAM handles chunk boundaries behind the scenes
  11. 11. Detecting BAM record boundaries <ul><li>First: BGZF blocks </li></ul><ul><ul><li>Easy, blocks begin with magic numbers (32 bits) </li></ul></ul><ul><ul><li>To make checking even more robust, multiple blocks are checked and backtracked if needed </li></ul></ul><ul><li>Second: BAM records </li></ul><ul><ul><li>Harder, no identifiers </li></ul></ul><ul><ul><li>But various fields cross-reference each other </li></ul></ul><ul><ul><li>We can detect records with very good accuracy </li></ul></ul>
  12. 12. Example: Preprocessing for Chipster genome browser How to allow interactive browsing with zooming in and out, for large BAM files? Can use sampling, but it is either slow or inaccurate Preprocess data and produce summaries at different levels (mipmapping) Implemented on top of Hadoop-BAM
  13. 13. Result looks nice
  14. 14. Benchmarking Take 50GB of data from 1000 Genomes Run on cluster of 112 AMD Opteron 2.6 GHz (1344 cores) and Infiniband interconnect
  15. 15. Scalability results
  16. 16. Scalability results (cnt.) Did sorting and summarizing Fairly nice scaling for the processing step No scaling for import and export Lesson: avoid moving data in and out of Hadoop So having to convert data from BAM to something else would be bad
  17. 17. Future plans <ul><li>Develop or port basic BAM tools to use Hadoop-BAM </li></ul><ul><li>Tools that work on BAM and BED files </li></ul><ul><li>Building on top of Hadoop-BAM </li></ul><ul><ul><li>Pig query engine </li></ul></ul><ul><ul><li>Variant detection pipelines </li></ul></ul><ul><li>Some ideas about doing join operations </li></ul><ul><ul><li>It's really hard... </li></ul></ul>
  18. 18. Conclusions Cloud computing is not a free lunch, but tools, algorithms and data formats need to be adapted Hadoop-BAM library available with MIT license: http://sourceforge.net/projects/hadoop-bam/ Contact: matti.niemenmaa@aalto.fi
  19. 19. Acknowledgements Matti Niemenmaa , André Schumacher, Keijo Heljanko (Aalto University, Department of Information and Computer Science) Petri Klemelä, Eija Korpelainen (CSC - IT Center for Science) TIVIT Cloud Software program for funding

×