The Performance of MapReduce: An In-depth Study
Upcoming SlideShare
Loading in...5
×
 

The Performance of MapReduce: An In-depth Study

on

  • 2,423 views

 

Statistics

Views

Total Views
2,423
Views on SlideShare
1,226
Embed Views
1,197

Actions

Likes
0
Downloads
39
Comments
0

8 Embeds 1,197

http://logicmd.net 1178
http://cache.baidu.com 6
http://feed.logicmd.net 4
http://translate.googleusercontent.com 4
http://reader.youdao.com 2
http://snapshot.soso.com 1
http://logicmd.net.sixxs.org 1
http://webcache.googleusercontent.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

The Performance of MapReduce: An In-depth Study The Performance of MapReduce: An In-depth Study Presentation Transcript

  • Dawei Jiang, Beng Chin Ooi, Lei Shi, Sai Wu, School of Computing, NUS Presented by Tang Kai
  •  Introduction Factors affecting Performance of MR Pruning search space Implementation Benchmark
  •  MapReduce-based systems are increasingly being used. ◦ Simple yet impressive interface  Map() Reduce() ◦ Flexible  Storage system independence ◦ Scalable ◦ Fine-grain fault tolerance
  •  Previous study ◦ Fundamental difference  Schema support  Data access  Fault tolerance ◦ Benchmark  Parallel DB >> MR-based
  •  Is it not possible to have a flexible, scalable and efficient MapReduce-based systems? Works ◦ Identify several performance bottlenecks ◦ manage bottlenecks and tune performance  well-known engineering and database techniques Conclusion ◦ 2.5x-3.5x
  •  Introduction Factors affecting Performance of MR Pruning search space Implementation Benchmark
  •  7 steps of a MapReduce job 1) Map 2) Parse 3) Process 4) Sort 5) Shuffle 6) Merge 7) Reduce
  •  I/O mode Indexing Parsing Sorting
  •  Direct I/O ◦ read data from the disk directly ◦ Local Streaming I/O ◦ streaming data from the storage system by an inter-process communication scheme,  such as TCP/IP or JDBC. ◦ Local and remote Direct I/O > Streaming I/O by 10%-15%
  •  Input of a MapReduce job ◦ a set of files stored in a distributed file system, i.e. HDFS Boost selection task 2x-10x  Ranged-indexes depending on the selectivity ◦ input HDFS files are not sorted but each data chunk in the files are indexed by keys  Block-level indexes ◦ tables stored in database servers  Database indexed tables
  •  Raw data -> <k,v> pair Immutable decoding ◦ Read-only records (set once) Mutable decoding Mutable decoder is 10x faster. ◦ boost selection task 2x overall
  •  Map-side sorting affects performance of aggregation ◦ Cost of key comparison is non-trivial. Example ◦ SourceIP in UserVisits Table ◦ Sort intermediate records. ◦ sourceIP variable-length string  String compare (byte-to-byte)  Fingerprint compare (integer) Fingerprint-based is 4x-5x faster. ◦ 20%-25% overall
  •  Why ◦ 4 factors  Resulting in large search space (2*2*3*2) ◦ Budget limit on Amazon EC2 Greedy
  •  Greedy Stategy 3 datasets Direct I/O I/O mode Stream I/O Different sort schemes Bench In various architecture markHadoop Writable Google’s Parser ProtocolBuffer Berkeley DB 4 queries
  •  Introduction Factors affecting Performance of MR Pruning search space Implementation Benchmark
  •  Hadoop 0.19.2 as code base Direct I/O ◦ Modification of data node implementation Text decoder ◦ Immutable same as Dewitt ◦ Mutable by ourselves Binary decoder ◦ Hadoop  Immutable Writable decoder  Mutable using hadoop API by ourselves ◦ Google Protocol buffer  Build-in compiler->mutable  Immutable by ourselves ◦ Berkeley DB  BDB binding API (mutable)
  •  Amazon EC2 (Elastic computing cloud) ◦ 7.5GB memory ◦ 2 virtual cores ◦ 64-bits Fedora 8 Tuning EC2 disk I/O by shifting peak time. Hadoop Setting ◦ Block size of HDFS: 512MB ◦ Heap size of JVM: 1024MB
  •  Introduction Factors affecting Performance of MR Pruning search space Implementation Benchmark
  •  Results for different I/O mode ◦ Single node ◦ No-op job w/ map w/o reduce
  •  Results for record parsing ◦ Run in Java process instead of MapReduce job ◦ Time start after loading into memory Mutable > Immutable ◦ Mutable text> mutable binary
  •  In between hadoop-based system ◦ Cache factor In between hadoop-based and Parallel DB ◦ Close
  •  Selection task -> scan -> Index Caching Indexing
  • UserVisits GROUP BY SUBSTR(so Parsing: 2x faster Sorting: 20%-25% faster ◦ Not significant in small size aggregation task
  •  On decoding scheme Comparison of tuned MR-based & Parallel DB
  •  Cons ◦ Need to be committed/forked to Hadoop source code tree ◦ A complete framework is needed instead of miscellaneous patches. ◦ Various API support: CLI, Web rather than Java. Future work ◦ Provide query parser, optimizer etc to build a complete solution ◦ Elastic power-aware data intensive Cloud  http://www.comp.nus.edu.sg/~epic/download/MapRe duceBenchmark.tar.gz Tenzing: A SQL Implemetation On The MapReduce Framework