Hadoop for Scientific Workloads__HadoopSummit2010


Published on

Hadoop Summit 2010 - Research Track
Hadoop for Scientific Workloads
Lavanya Ramakrishnan, Lawrence Berkeley National Lab

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • Range of application classes with different models Part of IMG family of systems hosted at the DOE Joint Genome Institute (JGI) Supports analysis of microbial community metagenomes in the integrated context of all public reference isolate microbial genomes Data pipeline, task parallel workflow, image matching algorithms that should work Might be heavy on Io side but the other advantages might outweigh the performance Data Integration challenges ~ 35 science data products including atmospheric and land products products are in different projection, resolutions (spatial and temporal), different times data volume and processing requirements exceed desktop capacity
  • There is a huge spectrum of scientific applications - High energy physics, eco-sciences, bioinformatics at LBL. These have a varied set of requirements and a need for unlimited compute cycles and data storage. NERSC and IT provide infrastructure and resources for these applications. Other groups in CRD work closely with the scientists to explore and develop user interfaces, middleware, grid tools, data support, infrastructure tools for monitoring that are required to facilitate scientific exploration.   Cloud computing brings in a new resource model of delivering “on-demand cycles at a cost” and a new set of programming models and tools. Many groups at LBL are interested in seeing how the different features of cloud computing would help them in their scientific explorations   In general we need to explore the big question of how do we work closely with scientists to deliver a more diverse set of services that not just target the traditional HPC applications   Make it easier for us to do what we have traditionally being doing? Help us do things differently than before? Can bring other users in?
  • Why Hadoop? What implications
  • Part of IMG family of systems hosted at the DOE Joint Genome Institute (JGI) Supports analysis of microbial community metagenomes in the integrated context of all public reference isolate microbial genomes Content maintenance consists of Integrating new metagenome datasets with the reference genomes every 2-4 weeks Involves running BLAST for identifying pair-wise gene similarities between new metagenome & reference genomes Reference genome baseline updated with new (~500) genomes every 4 months Involves running BLAST for refreshing pair-wise gene similarities between reference genomes, and between metagenome & reference genomes takes about 3 weeks on a Linux cluster with 256 cores Take away point is there is a growth in the databases BLAST is used majorly in pipleline
  • Hard limits in Hadoop config (3GB ulimit but DB > 3GB) Thrashing due to DB not fitting in available memory - first iteration 3.5 to 4.5 hrs for job to finish but 80% DB that fits into memory takes half the time Hadoop does not guarantee simultaneous availability of resources so time to solution is hard to predict
  • This is the final slide; generally for questions at the end of the talk. Please post your contact information here.
  • Here are the different features of cloud and each has an attraction for a class of users.   a. Who doesn’t want free cycles and the on-demand aspect is appealing. Getting 10 cpus for 1 hr now or getting 5 cpus for 2 hrs has the same cost. This combined with the idea that you don’t have to wait for CPUs is also very attractive for batch queue users. b. The virtual environments that seem common place tend to impose some overheads but when there are large parameteric studies such as BLAST, the overhead might be acceptable c. Users bear the brunt of OS and software upgrades – for e.g., supernova factory has code base that works only on 32 bit systems and as 64 bit systems are more common place they are restricted on where they can run. d. Science problems are exceeding current systems
  • Science gateways S3 storage Phasing
  • Hadoop for Scientific Workloads__HadoopSummit2010

    1. 1. Hadoop for Scientific Workloads <ul><li>Lavanya Ramakrishnan </li></ul><ul><li>Shane Canon </li></ul><ul><li>Shreyas Cholia </li></ul><ul><li>Keith Jackson </li></ul><ul><li>John Shalf </li></ul>Lawrence Berkeley National Lab
    2. 2. Example Scientific Applications <ul><li>Integrated Microbial Genomes (IMG) </li></ul><ul><ul><li>analysis of microbial community metagenomes in the integrated context of all public reference isolate microbial genomes </li></ul></ul><ul><li>Supernova Factory </li></ul><ul><ul><li>tools to measure expansion of universe and energy </li></ul></ul><ul><ul><li>task parallel workflow, large data volume </li></ul></ul><ul><li>MODerate-resolution Imaging Spectroradiometer (MODIS) </li></ul><ul><ul><li>two MODIS satellites near polar orbits </li></ul></ul><ul><ul><li>~ 35 science data products including atmospheric and land products </li></ul></ul><ul><ul><li>products are in different projection, resolutions (spatial and temporal), different times </li></ul></ul>
    3. 3. Supporting Science at LBL <ul><li>Unlimited need for compute cycles and data storage </li></ul><ul><li>Tools and middleware to access resources </li></ul>Scientists HPC and IT resources User interfaces, grid middleware, workflow tools, data management, etc <ul><li>Does cloud computing </li></ul><ul><li>make it easier or better to do what we do? </li></ul><ul><li>help us do things differently than before? </li></ul><ul><li>help us include other users? </li></ul>
    4. 4. Magellan – Exploring Cloud Computing <ul><li>Test-bed to explore Cloud Computing for Science </li></ul><ul><li>National Energy Research Scientific Computing Center (NERSC) </li></ul><ul><li>Argonne Leadership Computing Facility (ALCF) </li></ul><ul><li>Funded by DOE under the American Recovery and Reinvestment Act (ARRA) </li></ul>
    5. 5. Magellan Cloud at NERSC 720 nodes, 5760 cores in 9 Scalable Units (SUs)  61.9 Teraflops SU = IBM iDataplex rack with 640 Intel Nehalem cores 8G FC 10G Ethernet 14 I/O nodes (shared) 18 Login/network nodes 1 Petabyte with GPFS SU SU SU SU SU SU SU SU SU Load Balancer I/O I/O NERSC Global Filesystem Network Login Network Login QDR IB Fabric HPSS (15PB) Internet 100-G Router ANI
    6. 6. Magellan Research Agenda <ul><li>What are the unique needs and features of a science cloud? </li></ul><ul><li>What applications can efficiently run on a cloud? </li></ul><ul><li>Are cloud computing programming models such as Hadoop effective for scientific applications? </li></ul><ul><li>Can scientific applications use a data-as-a-service or software-as-a-service model? </li></ul><ul><li>Is it practical to deploy a single logical cloud across multiple DOE sites? </li></ul><ul><li>What are the security implications of user-controlled cloud images? </li></ul><ul><li>What is the cost and energy efficiency of clouds? </li></ul>
    7. 7. Hadoop for Science <ul><li>Classes of applications </li></ul><ul><ul><li>tightly coupled MPI application, loosely couple data intensive science </li></ul></ul><ul><ul><li>use batch queue systems in supercomputing centers, local clusters and desktop </li></ul></ul><ul><li>Advantages of Hadoop </li></ul><ul><ul><li>transparent data replication, data locality aware scheduling </li></ul></ul><ul><ul><li>fault tolerance capabilities </li></ul></ul><ul><li>Mode of operation </li></ul><ul><ul><li>use streaming to launch a script that calls executable </li></ul></ul><ul><ul><li>HDFS for input, need shared file system for binary and database </li></ul></ul><ul><ul><li>input format </li></ul></ul><ul><ul><ul><li>handle multi-line inputs (BLAST sequences), binary data (High Energy Physics) </li></ul></ul></ul>
    8. 8. Hadoop Benchmarking: Early Results <ul><li>Compare traditional parallel file systems to HDFS </li></ul><ul><ul><li>40 node Hadoop cluster where each node contains two Intel Nehalem quad-core processors </li></ul></ul><ul><ul><li>TeraGen and Terasort to compare file system performance </li></ul></ul><ul><ul><ul><li>32 maps for TeraGen and 64 reduces for Terasort over a terabyte of data </li></ul></ul></ul><ul><ul><li>TestDFSIO to understand concurrency </li></ul></ul>
    9. 9. + 287 Samples: ~105 Studies + 12.5 Mil genes 19 Mil genes IMG Systems: Genome & Metagenome Data Flow <ul><li>~ 350 - 500 Genomes </li></ul><ul><li>~ .5 – 1 Mil Genes </li></ul>Every 4 months 65 Samples: 21 Studies IMG+2.6 Mil genes 9.1 Mil total Monthly On demand On demand <ul><li>+ 330 Genomes </li></ul><ul><ul><li>158 GEBA </li></ul></ul><ul><li>8.2 Mil genes </li></ul>Monthly 5,115 Genomes 6.5 Mil genes
    10. 10. BLAST on Hadoop <ul><li>NCBI BLAST (2.2.22) </li></ul><ul><ul><li>reference IMG genomes- of 6.5 mil genes (~3Gb in size) </li></ul></ul><ul><ul><li>full input set 12.5 mil metagenome genes against reference </li></ul></ul><ul><li>BLAST Hadoop </li></ul><ul><ul><li>uses streaming to manage input data sequences </li></ul></ul><ul><ul><li>binary and databases on a shared file system </li></ul></ul><ul><li>BLAST Task Farming Implementation </li></ul><ul><ul><li>server reads inputs and manages the tasks </li></ul></ul><ul><ul><li>client runs blast, copies database to local disk or ramdisk once on startup, pushes back results </li></ul></ul><ul><ul><li>advantages: fault-resilient and allows incremental expansion as resources come available </li></ul></ul>
    11. 11. Hardware Platforms <ul><li>Franklin: Traditional HPC System </li></ul><ul><ul><li>40k core, 360TFLOP Cray XT4 system at NERSC, Lustre parallel filesystem </li></ul></ul><ul><li>Amazon EC2: Commercial “Infrastructure as a Service” Cloud </li></ul><ul><ul><li>Configure and boot customized virtual machines in Cloud </li></ul></ul><ul><li>Yahoo M45: Shared Research “Platform as a Service” Cloud </li></ul><ul><ul><li>400 nodes, 8 cores per node, Intel Xeon E5320, 6GB per compute node, 910.95TB </li></ul></ul><ul><ul><li>Hadoop/MapReduce service: HDFS and shared file system </li></ul></ul><ul><li>Windows Azure BLAST “Software as a Service” </li></ul>
    12. 12. BLAST Performance
    13. 13. BLAST on Yahoo! M45 Hadoop <ul><li>Initial config – Hadoop memory ulimit issues, </li></ul><ul><ul><li>Hadoop memory limits increased to accommodate high memory tasks </li></ul></ul><ul><ul><li>1 map per node for high memory tasks to reduce contention </li></ul></ul><ul><ul><li>thrashing when DB does not fit in memory </li></ul></ul><ul><li>NFS shared file system for common DB </li></ul><ul><ul><li>move DB to local nodes (copy to local /tmp). </li></ul></ul><ul><ul><li>initial copy takes 2 hours, but now BLAST job completes in < 10 minutes </li></ul></ul><ul><ul><li>performance is equivalent to other cloud environments. </li></ul></ul><ul><ul><li>future: Experiment with Distributed Cache </li></ul></ul><ul><li>Time to solution varies - no guarantee of simultaneous availability of resources </li></ul><ul><ul><li>Strong user group and sysadmin support was key in working through this. </li></ul></ul>
    14. 14. HBase for Metagenomics <ul><li>Output of “all vs. all” pairwise gene sequence comparisons </li></ul><ul><ul><li>currently data stored in compressed files </li></ul></ul><ul><ul><ul><li>modifying individual entries is challenging </li></ul></ul></ul><ul><ul><ul><li>queries are hard </li></ul></ul></ul><ul><ul><li>duplication of data to ease presentation by different UI components </li></ul></ul><ul><li>Evaluating changing to Hbase </li></ul><ul><ul><li>easily update individual rows and simple queries </li></ul></ul><ul><ul><li>query and update performance exceeds requirements </li></ul></ul><ul><li>Challenge: Bulk loads of approximately 30 billion rows </li></ul><ul><ul><li>trying multiple techniques for bulk loading </li></ul></ul><ul><ul><li>best practices are not well documented </li></ul></ul>
    15. 15. Magellan Application: De-novo assembly <ul><li>Move data from disk to clustered memory </li></ul><ul><li>Move analysis pipeline from single-node to parallel map/reduce jobs </li></ul><ul><li>== </li></ul><ul><li>efficient horizontal scalability </li></ul><ul><li>(more data -> add more nodes) </li></ul>Private/public cloud Memory requirements: ~500 GB (de Bruijn graph) CPU hours (single assembly): velveth: ~23h,velvetg: ~21h Source: Karan Bhatia
    16. 16. Summary <ul><li>Deployment Challenges </li></ul><ul><ul><li>all jobs run as user “hadoop” affecting file permissions </li></ul></ul><ul><ul><li>less control on how many nodes are used - affects allocation policies </li></ul></ul><ul><ul><li>file system performance for large file sizes </li></ul></ul><ul><li>Programming Challenges: No turn-key solution </li></ul><ul><ul><li>using existing code bases, managing input formats and data </li></ul></ul><ul><li>Performance </li></ul><ul><ul><li>BLAST over Hadoop: performance is comparable to existing systems </li></ul></ul><ul><ul><li>existing parallel file systems can be used through Hadoop On Demand </li></ul></ul><ul><li>Additional benchmarking, tuning needed </li></ul><ul><li>Plug-ins for Science </li></ul>
    17. 17. Acknowledgements <ul><li>This work was funded in part by the Advanced Scientific Computing Research (ASCR) in the DOE Office of Science under contract number DE-C02-05CH11231. </li></ul><ul><li>CITRIS/UC, Yahoo M45!, Greg Bell, Victor Markowitz, Rollin Thomas </li></ul>
    18. 18. Questions? <ul><li>[email_address] </li></ul>
    19. 19. Cloud Usage Model <ul><li>On-demand access to computing and cost associativity </li></ul><ul><li>Customized and controlled environments </li></ul><ul><ul><li>e.g., Supernova Factory codes have sensitivity to OS/compiler versions </li></ul></ul><ul><li>Overflow capacity to supplement existing systems </li></ul><ul><ul><li>e.g., Berkeley Water Center has analysis that far exceeds capacity of desktops </li></ul></ul><ul><li>Parallel programming models for data intensive science </li></ul><ul><ul><li>e.g., BLAST parametric runs </li></ul></ul>
    20. 20. NERSC Magellan Software Strategy <ul><li>Runtime provisioning of software images via Moab and xCat </li></ul><ul><li>Explore a variety of usage models </li></ul><ul><li>Choice of local or remote cloud </li></ul>ANI Magellan Cluster