Your SlideShare is downloading. ×
BioPig for scalable analysis of big sequencing data
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

BioPig for scalable analysis of big sequencing data


Published on

This talk was adapted from my presentation at the Finishing in the Future 2011, Santa Fe, NM.

This talk was adapted from my presentation at the Finishing in the Future 2011, Santa Fe, NM.

Published in: Technology, Education
1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. BioPig: Hadoop-based Analytic Toolkit for Next-Generation Sequence Data Zhong Wang, Ph.D. Computational Biology Staff Scientist
  • 2. Cellulase The deep metagenome approach to discover cellulases for biofuel research
  • 3. Large data, large reward Only 1% shared (>=95% identity) 50% validated activity Science. 2011 Jan 28;331(6016):463-7.
  • 4. Sequence data More data would be even better
  • 5. Rumen(2009) Rumen(2010) Rumen(2012) 17 Gb 250 Gb 1000 Gb But, can analysis keep up with data growth?
  • 6. Ideal solutions for the terabase problem 1.Scalable to 1Tb? 2.Performance (within hours)?
  • 7. High-Mem cluster Input/Output (IO)Memory
  • 8. MP/MPI solution: k-mer counting 1 2 3 4 Raw Data Data slices Each node/core has data and table slices Count table
  • 9. MP/MPI performance MPI version 412 Gb, 4.5B reads 2.7 hours on 128x24 cores NESRC HopperII MP Threaded version 268 Gb, 3B reads 5 days on 32 cores High-Mem Cluster • Experienced software engineers • Six months of development time • One nodes fails, all fail Problems: Fast, scalable
  • 10. Hadoop/Map Reduce framework • Google MapReduce – Data Parallel programming model to process petabyte data – Generally has a map and a reduce step • Apache Hadoop – Distributed file system (HDFS) and job handling for scalability and robustness – Data locality to bring compute to data, avoiding network transfer bottleneck
  • 11. Programmability: Hadoop vs Pig finding out top 5 websites young people visit
  • 12. BioPig: design goals • Flexible – every dataset is unique, data analysts have domain knowledge that is essential to optimize the analysis, – pluggable modules that analysts can use to build custom analytic pipelines, • High-Level – domain-specific language enable data analysts to create custom pipelines, – hide details of parallelism (too complex for most people), • Scalability – leverage data parallelism to speed up analytics, – integrate external tools and applications where necessary, – scale from 1 to hundreds of compute nodes with minimal effort and linear scalability. • Robustness – Data and computation are replicated across nodes to combat failures BioPIG
  • 13. Runs on any hardware supporting Hadoop • JGI Titanium (commodity hadoop cluster) – Up to 20 16-cores 32GB RAM 1.799Ghz, 1G Ethernet • NERSC Magellan Cloud Testbed – Up to 200 8-core 24GB RAM, and 2.67GHz Nehalem processors, 10Gbit InfiniBand, GPFS • Amazon AWS – Elastic MapReduce with cluster compute nodes (23 GB of memory, 2 x Intel quad-core “Nehalem” architecture 1690 GB of instance storage, 10G Ethernet
  • 14. BioPig Modules Blast Input/Output (Fasta,q) K-mer Counter Assembly
  • 15. How k-mer count is implemented Load Mapper Shuffle /sort Reducer Merge <id1, header, ‘attagc’> <id2, header, ‘gttagg’> <id1, ‘atta’>, <id1,’ttag’> <id2, ‘gtta’>, <id2, ‘ttag’> <‘atta’, id1>, <‘ttag’, id1, id2> <‘gtta’, id2>, <‘tagg’, id2> <‘atta’, 1>, <‘ttag’, 2> <‘gtta’, 1>, <‘tagg’, 1> <‘atta’, 3>, <‘ttag’, 2> <‘gtta’, 2>, <‘tagg’, 1>
  • 16. A 7-liner BioPig script for k-mer counting
  • 17. Rumen metagenome gene discovery pipeline Read preprocess (remove artifacts) pigBlast (blast reads against known cellulases) pigAssembler (Assemble reads into contigs) pigExtender (Extend contigs into full-length enzymes)
  • 18. Cloud solution to large data BioPig- Blaster BioPig- Assembler BioPig- Extender BioPIG BioPig: 61 lines of code MPI-extender: ~12,000 lines (vs 31 in BioPig) Flexibility Programmability Scalability x x
  • 19. Conclusions Hadoop-based BioPig shows great potential for scalable analysis on very large sequence data, it is robust and easy to use.
  • 20. Challenges in application • IO optimization, e.g., reduce data copying • Some problems do not easily fit into map/reduce framework, e.g., graph-based algorithms • Integration into exiting framework, Galaxy
  • 21. Acknowledgement • Karan Bhatia • Henrik Nordberg • Kai Wang • Rob Egan • Alex Sczyrba • Jeremy Brand @JGI/NERSC • Shane Cannon @NERSC BioPIG