SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
myHadoop - Hadoop-on-Demand
on Traditional HPC R...
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Acknowledgements
• MahidharTatineni
• ChaitanyaB...
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Outline
• Motivations
• Technical Challenges
• I...
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Motivations
• An open source tool for running Ha...
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Some Ground Rules
• What this presentation is:
•...
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Main Challenges
• Shared-nothing (Hadoop) versus...
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Traditional HPC Architecture
PARALLEL FILE SYSTE...
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Hadoop and HPC Batch Systems
• Access to HPC res...
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
myHadoop Requirements
1. Enabling execution of H...
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
COMPUTE NODES
PERSISTENT MODE NON-PERSISTENT MOD...
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Implementation Details: PBS, SGE
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
User Workflow
BOOTSTRAP
TEARDOWN
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Performance Evaluation
• Goals and non-goals
• S...
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
HadoopBlast
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Data Selections
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Related Work
• Recipe for running Hadoop over PB...
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Center for Large-Scale Data Systems Research (CL...
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Summary
• myHadoop – an open source tool for run...
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Questions?
• Email me at sriram@sdsc.edu
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Appendix
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
io.file.buffer.size 131072 Size of read/write bu...
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Data SelectCounts on Dash
Upcoming SlideShare
Loading in...5
×

myHadoop - Hadoop-on-Demand on Traditional HPC Resources

699

Published on

Hadoop-on-demand on traditional HPC resources at the UC Cloud Summit, 2011 (http://www.ucgrid.org/cloud2011/UCCloudSummit2011.html).

Published in: Technology, Education, Business
1 Comment
0 Likes
Statistics
Notes
  • http://dbmanagement.info/Tutorials/Hadoop.htm
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

No Downloads
Views
Total Views
699
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
17
Comments
1
Likes
0
Embeds 0
No embeds

No notes for slide
  • Mahidhar, Chaitan: Architecture, prototypingJim: myHadoop rollShava: UC Grid support
  • Motivations – why myHadoopTechnical challenges – what are the problemsImplementation Details – howPerformance evaluation - findings
  • Note: we didn’t make up these requirements. Came out of our existing requirements as end-users and computer scientistsMost of us have access to resources such as the TeraGrid, UC Grid, SDSC TritonI have no official affiliation with any of those resources – but I had access to these resources, and wanted to use them for performance studies
  • Co-location of data and computes in shared-nothing – no centralized shared storage in the Hadoop modelHigh performance parallel file systems for HPC resources
  • Scheduler access2, 3) Non-root concurrent users4) Persistent and non-persistent modes
  • Data loads not very dominant – more CPU-intensivePerformance slightly better on local disk – more contention on Oasis, and also that Lustre is optimized for large files, not lots of smaller ones
  • Data loads are the dominating factor for the non-persistent runsMakes more sense to leave the data on shared file system – and use that at the HDFS locationOutput writes are also time consuming in this case – so might as well leave the data in HDFS for future runs
  • myHadoop - Hadoop-on-Demand on Traditional HPC Resources

    1. 1. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO myHadoop - Hadoop-on-Demand on Traditional HPC Resources Sriram Krishnan, Ph.D. sriram@sdsc.edu
    2. 2. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Acknowledgements • MahidharTatineni • ChaitanyaBaru • Jim Hayes • ShavaSmallen
    3. 3. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Outline • Motivations • Technical Challenges • Implementation Details • Performance Evaluation
    4. 4. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Motivations • An open source tool for running Hadoop jobs on HPC resources • Easy to configure and use for the end-user • Play nice with existing batch systems on HPC resources • Why do we need such a tool? • End-users: I already have Hadoop code – and I only have access to regular HPC-style resources • Computer Scientists: I want to study the implications of using Hadoop on HPC resources • And I don’t have root access to these resources
    5. 5. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Some Ground Rules • What this presentation is: • A “how-to” for running Hadoop jobs on HPC resources using myHadoop • A description of the performance implications of using myHadoop • What this presentation is not: • A propaganda for the use of Hadoop on HPC resources
    6. 6. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Main Challenges • Shared-nothing (Hadoop) versus HPC-style architectures • In terms of philosophies and implementation • Control and co-existence of Hadoop and HPC batch systems • Typically both Hadoop and HPC batch systems (viz., SGE, PBS) need completely control over the resources for scheduling purposes
    7. 7. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Traditional HPC Architecture PARALLEL FILE SYSTEM COMPUTE CLUSTER WITH MINIMAL LOCAL STORAGE Shared-nothing (MapReduce-style) Architectures COMPUTE/DATA CLUSTER WITH LOCAL STOARGE ETHERNET
    8. 8. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Hadoop and HPC Batch Systems • Access to HPC resources is typically via batch systems – viz. PBS, SGE, Condor, etc • These systems have complete control over the compute resources • Users typically can’t log in directly to the compute nodes (via ssh) to start various daemons • Hadoop manages its resources using its own set of daemons • NameNode&DataNodefor Hadoop Distributed File System (HDFS) • JobTracker&TaskTrackerfor MapReduce jobs • Hadoop daemons and batch systems can’t co-exist seamlessly • Will interfere with each other’s scheduling algorithms
    9. 9. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO myHadoop Requirements 1. Enabling execution of Hadoop jobs on shared HPC resources via traditional batch systems a) Working with a variety of batch systems (PBS, SGE, etc) 2. Allowing users to run Hadoop jobs without needing root-level access 3. Enabling multiple users to simultaneously execute Hadoop jobs on the shared resource 4. Allowing users to either run a fresh Hadoop instance each time (a), or store HDFS state for future runs (b)
    10. 10. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO COMPUTE NODES PERSISTENT MODE NON-PERSISTENT MODE BATCH PROCESSING SYSTEM (PBS, SGE) PARALLEL FILE SYSTEM HADOOP DAEMONS myHadoop Architecture [2, 3] [1] [4(a)][4(b)]
    11. 11. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Implementation Details: PBS, SGE
    12. 12. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO User Workflow BOOTSTRAP TEARDOWN
    13. 13. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Performance Evaluation • Goals and non-goals • Study the performance overhead and implication of myHadoop • Not to optimize/improve existing Hadoop code • Software and Hardware • Triton Compute Cluster (http://tritonresource.sdsc.edu/) • Triton Data Oasis (Lustre-based parallel file system) for data storage, and for HDFS in “persistent mode” • Apache Hadoop version 0.20.2 • Various parameters tuned for performance on Triton • Applications • Compute-intensive: HadoopBlast (Indiana University) • Modest-sized inputs – 128 query sequences (70K each) • Compared against NR database – 200MB in size • Data-intensive: Data Selections (OpenTopography Facility at SDSC) • Input size from 1GB to 100GB • Sub-selecting around 10% of the entire dataset
    14. 14. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO HadoopBlast
    15. 15. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Data Selections
    16. 16. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Related Work • Recipe for running Hadoop over PBS in blogosphere • http://jaliyacgl.blogspot.com/2008/08/hadoop-as-batch-job-using- pbs.html • myHadoop is “inspired” by their approach – but is more general- purpose and configurable • Apache Hadoop On Demand (HOD) • http://hadoop.apache.org/common/docs/r0.17.0/hod.html • Only PBS support, needs external HDFS, harder to use, and has trouble with multiple concurrent Hadoop instances • CloudBatch – batch queuing system on clouds • Use of Hadoop to run batch systems like PBS • Exact opposite of our goals – but similar approach
    17. 17. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Center for Large-Scale Data Systems Research (CLDS) CLDS Industry Advisory Board Academic Advisory Board Benchmarking,Perfor mance Evaluation and Systems Development Projects Industry Forums and Professional Education Industry-University Consortium on Software for Large-scale Data Systems How Much Information? Project Public Private Personal Visiting Fellows Information Metrology Data Growth, Information Mgt Cloud Storage Architecture Cloud Storage and Performance Benchmarking Industry Interchange Mgt, Technical Forums • Student internships • Joint collaborations
    18. 18. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Summary • myHadoop – an open source tool for running Hadoop jobs on HPC resources • Without need for root-level access • Co-exists with traditional batch systems • Allows “persistent” and “non-persistent” modes to save HDFS state across runs • Tested on SDSC Triton, TeraGrid and UC Grid resources • More information • Software: https://sourceforge.net/projects/myhadoop/ • SDSC Tech Report: http://www.sdsc.edu/pub/techreports/SDSC-TR- 2011-2-Hadoop.pdf
    19. 19. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Questions? • Email me at sriram@sdsc.edu
    20. 20. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Appendix
    21. 21. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO io.file.buffer.size 131072 Size of read/write buffer fs.inmemory.size.mb 650 Size of in-memory FS for merging outputs io.sort.mb 650 Memory limit for sorting data core-site.xml: dfs.replication 2 Number of times data is replicated dfs.block.size 134217728 HDFS block size in bytes dfs.datanode.handler.count 64 Number of handlers to serve block requests hdfs-site.xml: mapred.reduce.parallel.copies 4 Number of parallel copies run by reducers mapred.tasktracker.map.tasks.maximum 4 Max map tasks to run simultaneously mapred.tasktracker.reduce.tasks.maximum 2 Max reduce tasks to run simultaneously mapred.job.reuse.jvm.num.tasks 1 Reuse the JVM between tasks mapred.child.java.opts -Xmx1024m Large heap size for child JVMs hdfs-site.xml:
    22. 22. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Data SelectCounts on Dash
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×