Benchmarking
Steve Loughran
Julio Guijarro




© 2009 Hewlett-Packard Development Company, L.P.
The information contained ...
Benchmarks
Some Problems

•  Estimating Hadoop performance of hardware
•  Estimating Hadoop performance of a cluster
•  Designing Had...
Recent customer request




     "They want data for
        Hadoop Sort
        for 100GB."
Terasort: what else?

•    PageRank: CPU intensive, small (static) input
     dataset
•  Something that stresses RAM and C...
Test Datasets

•    Wikipedia: 5-10 TB of XML data with changes;
     user relationships have to be inferred
•  SpamAssass...
Network Measurement



 What to add to Hadoop/Avro/Thrift to
 monitor network traffic -and relate to
            specific ...
Predicting performance



  Can an MR job on small datasets predict
    performance on full size datasets?

   What extra ...
Hardware Q


  What should a Hadoop-ready server
              look like?

         What about a rack?


             Or a...
Benchmarking
Upcoming SlideShare
Loading in...5
×

Benchmarking

2,414

Published on

Thoughts on improving Hadoop benchmarking

Published in: Business, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,414
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
84
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Benchmarking

  1. 1. Benchmarking Steve Loughran Julio Guijarro © 2009 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
  2. 2. Benchmarks
  3. 3. Some Problems •  Estimating Hadoop performance of hardware •  Estimating Hadoop performance of a cluster •  Designing Hadoop-ready servers •  Designing Hadoop-ready clusters •  Optimising the network for Hadoop •  Optimising Hadoop/HDFS for specific applications
  4. 4. Recent customer request "They want data for Hadoop Sort for 100GB."
  5. 5. Terasort: what else? •  PageRank: CPU intensive, small (static) input dataset •  Something that stresses RAM and CPU •  Something that seeks in the files?
  6. 6. Test Datasets •  Wikipedia: 5-10 TB of XML data with changes; user relationships have to be inferred •  SpamAssassin: 70+ GB of SPAM •  Physics? Something Small?
  7. 7. Network Measurement What to add to Hadoop/Avro/Thrift to monitor network traffic -and relate to specific jobs?
  8. 8. Predicting performance Can an MR job on small datasets predict performance on full size datasets? What extra instrumentation can help?
  9. 9. Hardware Q What should a Hadoop-ready server look like? What about a rack? Or a container?
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×