• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content







Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    F07-Cloud-Hadoop-BAM F07-Cloud-Hadoop-BAM Presentation Transcript

    • Hadoop-BAM: Directly manipulating BAM on Hadoop Aleksi Kallio CSC - IT Center for Science, Finland BOSC 2011, July 16, Vienna
    • Background Chipster 2.0: seamless integration of analysis tools, computing clusters and visualizations through a user friendly interface With NGS data, the ”seamless” part gets really hard... Use Hadoop to improve user experience Hadoop-BAM: small side product that might prove to be useful for quite many people
    • Problem definition
      • Because of NGS instruments, we are in the middle of data deluge
      • BAM (Binary Alignment/Map) files are a standardized and compact way of storing (aligned) reads [Samtools]
      • So, what does ”data deluge” mean?
        • “ Data deluge is a situation where one desperately tries to find space for yet another huge set of BAM (and fastq) files.”
    • Problem definition (it gets worse...) You don't only need to store data, but you also have to do something with it Pipelines take a long time to run And in real life you don't use your pipelines once, but often tweak and rerun and rerun...
    • Enter: Hadoop Map-reduce is a framework for processing terabytes of data in a distributed way Hadoop is an open source implementation of the Google's map-reduce framework NGS data has a lot in common with web logs, which were the original motivation for map-reduce
    • Map-reduce framework
    • Hadoop and map-reduce
      • The framework basically implements a distributed sorting algorithm
      • User has to write “map” and ”reduce” functions, nothing else
      • The framework does automatic parallelization and fault tolerance
      • But BAM is not Hadoop friendly:
        • Binary record format
        • BGZF compression on top of that
    • Possible solutions
      • Implement your own map-reduce framework
        • Ouch...
      • Convert to Hadoop-friendly text format
        • Storage size blows up
        • Network speed would become a bottleneck
      • Find a way to cope with BAM files in Hadoop
        • So we have Hadoop-BAM
    • Hadoop-BAM Small and simple Java library Throw it into your Hadoop installation BAM! Your BAM files are accessible by Hadoop map-reduce functions
    • What does it do? Gives you Picard SAM API Hadoop splits data into chunks and special care is needed at chunk boundaries Hadoop-BAM handles chunk boundaries behind the scenes
    • Detecting BAM record boundaries
      • First: BGZF blocks
        • Easy, blocks begin with magic numbers (32 bits)
        • To make checking even more robust, multiple blocks are checked and backtracked if needed
      • Second: BAM records
        • Harder, no identifiers
        • But various fields cross-reference each other
        • We can detect records with very good accuracy
    • Example: Preprocessing for Chipster genome browser How to allow interactive browsing with zooming in and out, for large BAM files? Can use sampling, but it is either slow or inaccurate Preprocess data and produce summaries at different levels (mipmapping) Implemented on top of Hadoop-BAM
    • Result looks nice
    • Benchmarking Take 50GB of data from 1000 Genomes Run on cluster of 112 AMD Opteron 2.6 GHz (1344 cores) and Infiniband interconnect
    • Scalability results
    • Scalability results (cnt.) Did sorting and summarizing Fairly nice scaling for the processing step No scaling for import and export Lesson: avoid moving data in and out of Hadoop So having to convert data from BAM to something else would be bad
    • Future plans
      • Develop or port basic BAM tools to use Hadoop-BAM
      • Tools that work on BAM and BED files
      • Building on top of Hadoop-BAM
        • Pig query engine
        • Variant detection pipelines
      • Some ideas about doing join operations
        • It's really hard...
    • Conclusions Cloud computing is not a free lunch, but tools, algorithms and data formats need to be adapted Hadoop-BAM library available with MIT license: http://sourceforge.net/projects/hadoop-bam/ Contact: matti.niemenmaa@aalto.fi
    • Acknowledgements Matti Niemenmaa , André Schumacher, Keijo Heljanko (Aalto University, Department of Information and Computer Science) Petri Klemelä, Eija Korpelainen (CSC - IT Center for Science) TIVIT Cloud Software program for funding