Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.



Published on

Thoughts on improving Hadoop benchmarking

Published in: Business, Technology
  • Be the first to comment


  1. 1. Benchmarking Steve Loughran Julio Guijarro © 2009 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
  2. 2. Benchmarks
  3. 3. Some Problems •  Estimating Hadoop performance of hardware •  Estimating Hadoop performance of a cluster •  Designing Hadoop-ready servers •  Designing Hadoop-ready clusters •  Optimising the network for Hadoop •  Optimising Hadoop/HDFS for specific applications
  4. 4. Recent customer request "They want data for Hadoop Sort for 100GB."
  5. 5. Terasort: what else? •  PageRank: CPU intensive, small (static) input dataset •  Something that stresses RAM and CPU •  Something that seeks in the files?
  6. 6. Test Datasets •  Wikipedia: 5-10 TB of XML data with changes; user relationships have to be inferred •  SpamAssassin: 70+ GB of SPAM •  Physics? Something Small?
  7. 7. Network Measurement What to add to Hadoop/Avro/Thrift to monitor network traffic -and relate to specific jobs?
  8. 8. Predicting performance Can an MR job on small datasets predict performance on full size datasets? What extra instrumentation can help?
  9. 9. Hardware Q What should a Hadoop-ready server look like? What about a rack? Or a container?