The Performance of MapReduce: An In-depth Study

2,577 views

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,577
On SlideShare
0
From Embeds
0
Number of Embeds
1,199
Actions
Shares
0
Downloads
78
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

The Performance of MapReduce: An In-depth Study

  1. 1. Dawei Jiang, Beng Chin Ooi, Lei Shi, Sai Wu, School of Computing, NUS Presented by Tang Kai
  2. 2.  Introduction Factors affecting Performance of MR Pruning search space Implementation Benchmark
  3. 3.  MapReduce-based systems are increasingly being used. ◦ Simple yet impressive interface  Map() Reduce() ◦ Flexible  Storage system independence ◦ Scalable ◦ Fine-grain fault tolerance
  4. 4.  Previous study ◦ Fundamental difference  Schema support  Data access  Fault tolerance ◦ Benchmark  Parallel DB >> MR-based
  5. 5.  Is it not possible to have a flexible, scalable and efficient MapReduce-based systems? Works ◦ Identify several performance bottlenecks ◦ manage bottlenecks and tune performance  well-known engineering and database techniques Conclusion ◦ 2.5x-3.5x
  6. 6.  Introduction Factors affecting Performance of MR Pruning search space Implementation Benchmark
  7. 7.  7 steps of a MapReduce job 1) Map 2) Parse 3) Process 4) Sort 5) Shuffle 6) Merge 7) Reduce
  8. 8.  I/O mode Indexing Parsing Sorting
  9. 9.  Direct I/O ◦ read data from the disk directly ◦ Local Streaming I/O ◦ streaming data from the storage system by an inter-process communication scheme,  such as TCP/IP or JDBC. ◦ Local and remote Direct I/O > Streaming I/O by 10%-15%
  10. 10.  Input of a MapReduce job ◦ a set of files stored in a distributed file system, i.e. HDFS Boost selection task 2x-10x  Ranged-indexes depending on the selectivity ◦ input HDFS files are not sorted but each data chunk in the files are indexed by keys  Block-level indexes ◦ tables stored in database servers  Database indexed tables
  11. 11.  Raw data -> <k,v> pair Immutable decoding ◦ Read-only records (set once) Mutable decoding Mutable decoder is 10x faster. ◦ boost selection task 2x overall
  12. 12.  Map-side sorting affects performance of aggregation ◦ Cost of key comparison is non-trivial. Example ◦ SourceIP in UserVisits Table ◦ Sort intermediate records. ◦ sourceIP variable-length string  String compare (byte-to-byte)  Fingerprint compare (integer) Fingerprint-based is 4x-5x faster. ◦ 20%-25% overall
  13. 13.  Why ◦ 4 factors  Resulting in large search space (2*2*3*2) ◦ Budget limit on Amazon EC2 Greedy
  14. 14.  Greedy Stategy 3 datasets Direct I/O I/O mode Stream I/O Different sort schemes Bench In various architecture markHadoop Writable Google’s Parser ProtocolBuffer Berkeley DB 4 queries
  15. 15.  Introduction Factors affecting Performance of MR Pruning search space Implementation Benchmark
  16. 16.  Hadoop 0.19.2 as code base Direct I/O ◦ Modification of data node implementation Text decoder ◦ Immutable same as Dewitt ◦ Mutable by ourselves Binary decoder ◦ Hadoop  Immutable Writable decoder  Mutable using hadoop API by ourselves ◦ Google Protocol buffer  Build-in compiler->mutable  Immutable by ourselves ◦ Berkeley DB  BDB binding API (mutable)
  17. 17.  Amazon EC2 (Elastic computing cloud) ◦ 7.5GB memory ◦ 2 virtual cores ◦ 64-bits Fedora 8 Tuning EC2 disk I/O by shifting peak time. Hadoop Setting ◦ Block size of HDFS: 512MB ◦ Heap size of JVM: 1024MB
  18. 18.  Introduction Factors affecting Performance of MR Pruning search space Implementation Benchmark
  19. 19.  Results for different I/O mode ◦ Single node ◦ No-op job w/ map w/o reduce
  20. 20.  Results for record parsing ◦ Run in Java process instead of MapReduce job ◦ Time start after loading into memory Mutable > Immutable ◦ Mutable text> mutable binary
  21. 21.  In between hadoop-based system ◦ Cache factor In between hadoop-based and Parallel DB ◦ Close
  22. 22.  Selection task -> scan -> Index Caching Indexing
  23. 23. UserVisits GROUP BY SUBSTR(so Parsing: 2x faster Sorting: 20%-25% faster ◦ Not significant in small size aggregation task
  24. 24.  On decoding scheme Comparison of tuned MR-based & Parallel DB
  25. 25.  Cons ◦ Need to be committed/forked to Hadoop source code tree ◦ A complete framework is needed instead of miscellaneous patches. ◦ Various API support: CLI, Web rather than Java. Future work ◦ Provide query parser, optimizer etc to build a complete solution ◦ Elastic power-aware data intensive Cloud  http://www.comp.nus.edu.sg/~epic/download/MapRe duceBenchmark.tar.gz Tenzing: A SQL Implemetation On The MapReduce Framework

×