Hadoop Simulation and Performance<br />Apache Hadoop India Summit 2011<br />Ranjit Mathew, Yahoo! R & D India<br />Copyrig...
Overview<br />2<br />Introduction<br />GridMix3<br />PigMix2<br />Tips<br />Plans<br />Q & A<br />
3<br />Introduction<br />
Why?<br />4<br />Capacity Planning<br />Benchmarking<br />Comparative evaluation of releases<br />Basis for improvements<b...
Performance Evaluation Techniques<br />5<br />Analytical Modeling<br />Use statistics, queuing theory, etc. to model syste...
Hadoop Performance Evaluation Tools<br />6<br />GridMix3<br />PigMix2<br />TeraSort / GraySort<br />DFSIO, NNBench, S-Live...
7<br />GridMix3<br />
GridMix Evolution<br />8<br />GridMix1 (HADOOP-2369):<br /><ul><li>Representative mix of Jobs
mapreduce/src/benchmarks/gridmix</li></ul>GridMix2 (HADOOP-3770):<br /><ul><li>More configurable; uses JobControl
mapreduce/src/benchmarks/gridmix2</li></ul>GridMix3 (MAPREDUCE-776):<br /><ul><li>Trace-based; better emulation-accuracy
mapreduce/src/contrib/gridmix</li></ul>Rumen (MAPREDUCE-751):<br /><ul><li>Supporting tool for GridMix3 et al
mapreduce/src/tools/org/apache/hadoop/tools/rumen</li></li></ul><li>GridMix3<br />9<br />Macro benchmark for Hadoop<br />T...
Rumen<br />10<br />Comprises:<br />TraceBuilder - Job Traces from Job History and Configuration<br />Folder - Scales Job T...
GridMix3 Flow<br />11<br />GridMix3<br />Production Cluster<br />Data Generator<br />Job Submitter<br />Job Histories & Co...
GridMix3 Architecture<br />12<br />GridMix3<br />JobFactory<br />JobSubmitter<br />JobMonitor<br />GridmixJob<br />MapRedu...
GridMix3 Emulation-Accuracy<br />13<br />
Submission Policies and Job Types<br />14<br />Submission policy determines when Jobs are submitted:<br />STRESS - Keep cl...
15<br />PigMix2<br />
PigMix Evolution<br />16<br />PigMix1:<br /><ul><li>Representative mix of 12 Pig scripts and Java programs
http://wiki.apache.org/pig/PigMix
http://wiki.apache.org/pig/DataGeneratorHadoop</li></ul>PigMix2 (PIG-200):<br /><ul><li>Added 5 Pig scripts and Java programs
Re-factored data-generation</li></li></ul><li>PigMix2<br />17<br />Benchmark for Pig<br />Representative mix of 17 Pig scr...
PigMix2 Flow<br />18<br />Input Data<br />Data Generator<br />PigMix2<br />Benchmark Cluster<br />
19<br />Tips<br />
Minimize Variance<br />20<br />Check hardware, especially for failing hard-drives<br />Use large data-sets to minimize eff...
Upcoming SlideShare
Loading in …5
×

Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew

2,625 views

Published on

  • Be the first to comment

Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew

  1. 1. Hadoop Simulation and Performance<br />Apache Hadoop India Summit 2011<br />Ranjit Mathew, Yahoo! R & D India<br />Copyright © 2011 Yahoo! All rights reserved.<br />
  2. 2. Overview<br />2<br />Introduction<br />GridMix3<br />PigMix2<br />Tips<br />Plans<br />Q & A<br />
  3. 3. 3<br />Introduction<br />
  4. 4. Why?<br />4<br />Capacity Planning<br />Benchmarking<br />Comparative evaluation of releases<br />Basis for improvements<br />Debugging<br />
  5. 5. Performance Evaluation Techniques<br />5<br />Analytical Modeling<br />Use statistics, queuing theory, etc. to model system<br />Use models to predict behavior<br />Simulation<br />Simulate work-load based on representation or traces<br />Benchmarking used to compare variants<br />Measurement<br />Use metrics gathered from tools and logs<br />Measure under peak, regular and light work-loads<br />Ref.: “The Art of Computer Systems Performance Analysis”, Raj K. Jain (Wiley, 1991)<br />
  6. 6. Hadoop Performance Evaluation Tools<br />6<br />GridMix3<br />PigMix2<br />TeraSort / GraySort<br />DFSIO, NNBench, S-Live<br />HiBench<br />etc.<br />
  7. 7. 7<br />GridMix3<br />
  8. 8. GridMix Evolution<br />8<br />GridMix1 (HADOOP-2369):<br /><ul><li>Representative mix of Jobs
  9. 9. mapreduce/src/benchmarks/gridmix</li></ul>GridMix2 (HADOOP-3770):<br /><ul><li>More configurable; uses JobControl
  10. 10. mapreduce/src/benchmarks/gridmix2</li></ul>GridMix3 (MAPREDUCE-776):<br /><ul><li>Trace-based; better emulation-accuracy
  11. 11. mapreduce/src/contrib/gridmix</li></ul>Rumen (MAPREDUCE-751):<br /><ul><li>Supporting tool for GridMix3 et al
  12. 12. mapreduce/src/tools/org/apache/hadoop/tools/rumen</li></li></ul><li>GridMix3<br />9<br />Macro benchmark for Hadoop<br />Trace-based submission of synthetic Jobs<br />Traces based on production clusters<br />Traces generated by Rumen<br />No access to original Job’s code or data<br />Emulates I/O and other aspects<br />Highly configurable<br />
  13. 13. Rumen<br />10<br />Comprises:<br />TraceBuilder - Job Traces from Job History and Configuration<br />Folder - Scales Job Traces to a given time-window<br />Job Traces are in JSON format<br />Insulation for release-to-release changes in format and contents<br />Statistical information on Jobs in Trace<br />Provides API to access Job Traces<br />
  14. 14. GridMix3 Flow<br />11<br />GridMix3<br />Production Cluster<br />Data Generator<br />Job Submitter<br />Job Histories & Configuration<br />Job Trace<br />Rumen<br />Benchmark Cluster<br />
  15. 15. GridMix3 Architecture<br />12<br />GridMix3<br />JobFactory<br />JobSubmitter<br />JobMonitor<br />GridmixJob<br />MapReduceJob<br />Job<br />JobStory<br />Status<br />JobTracker<br />Rumen<br />
  16. 16. GridMix3 Emulation-Accuracy<br />13<br />
  17. 17. Submission Policies and Job Types<br />14<br />Submission policy determines when Jobs are submitted:<br />STRESS - Keep cluster under stress (but not overwhelm it)<br />REPLAY - Faithful emulation of inter-job submission times<br />SERIAL - Submit a Job only after the previous one finishes<br />Types of synthetic Jobs:<br />LOADJOB - Emulates work-load from Job Trace<br />SLEEPJOB - Do nothing for periods from Job Trace<br />
  18. 18. 15<br />PigMix2<br />
  19. 19. PigMix Evolution<br />16<br />PigMix1:<br /><ul><li>Representative mix of 12 Pig scripts and Java programs
  20. 20. http://wiki.apache.org/pig/PigMix
  21. 21. http://wiki.apache.org/pig/DataGeneratorHadoop</li></ul>PigMix2 (PIG-200):<br /><ul><li>Added 5 Pig scripts and Java programs
  22. 22. Re-factored data-generation</li></li></ul><li>PigMix2<br />17<br />Benchmark for Pig<br />Representative mix of 17 Pig scripts<br />Corresponding native MapReduce Java programs<br />Specifications-based input-data generator<br />
  23. 23. PigMix2 Flow<br />18<br />Input Data<br />Data Generator<br />PigMix2<br />Benchmark Cluster<br />
  24. 24. 19<br />Tips<br />
  25. 25. Minimize Variance<br />20<br />Check hardware, especially for failing hard-drives<br />Use large data-sets to minimize effects of overheads<br />Beware of speculative execution<br />Set ipc.ping.interval to 5000 (HADOOP-5380)<br />Use appropriate PARALLEL clause in PigMix2 Pig scripts<br />Several runs needed for proper analysis<br />
  26. 26. Apples to Apples Comparison<br />21<br />Benchmarking versus Production Cluster:<br />Same hardware<br />Same software stack<br />Same configuration<br />Similar networking<br />Same size (might not be feasible)<br />Extrapolating results can be tricky<br />
  27. 27. 22<br />Plans<br />
  28. 28. Future Work<br />23<br />Greater emulation-accuracy in GridMix3:<br />Distributed Cache<br />Compression<br />CPU usage<br />Memory usage<br />More comprehensive Job Traces from Rumen<br />Integration of PigMix2 with Pig Statistics<br />
  29. 29. 24<br />Q & A<br />
  30. 30. ranjit<br />mathew<br />senior principal engineer<br />ranjit@yahoo-inc.com<br />

×