MapReduce: A useful parallel tool that still has room for improvement

MapReduce: A Useful Parallel Tool
that Still Has Room for Improvement

January 5, 2012

Kyong-Ha Lee
bart7449@gmail.com

Copyright © KAIST Database Lab. All Rights Reserved.

Outline
Three topics that I will discuss :
♦ Anatomy of the MapReduce framework
– Basic principles about the MapReduce framework
– Not much discussion on implementation details, but will be
happy to discuss them if there are any questions.
♦ A brief survey on the study of improving the
conventional MapReduce framework
♦ Research projects on going at KAIST


Big Data
♦ A large data set hard to work with using an on-hand
DBMS in a single node

♦ Data growth challenges are defined as*
– Increasing volume(amount of data),
– Velocity (speed of data in/out)
– Variety (range of data types, sources)

* Doug Laney, ―3D Data Management: Controlling Data Volume, Velocity and Variety‖, 2001


Importance and Impact
♦ ―Data center is the computer. If MapReduce is the first
instruction of the data center computer, I can’t wait to
see the rest of the instruction set, as well as the data
center programming language, the data center operating
system, the data center storage systems, and more.‖
- David A. Patterson. Technical perspective: the data center is the
computer. CACM, 51(1):105, 2008.
♦ A list of institutions that are using Hadoop, an open-source
Java implementation of MapReduce
♦ Its scholastic impact!

as of Dec 31, 2011 © KAIST Database Lab. All Rights Reserved.
Copyright

Usage Statistics Over Time at Google
Aug ‘04 Mar ‘06 Sep ‘07 Sep ‘09
The number of jobs 29K 171K 2,217K 3,467K
Average completion 634 874 395 475
time (secs)
Machine years used 217 2,002 11,081 25,562
Input data read(TB) 3,288 52,254 403,152 544,130
Intermediate data(TB) 758 6,743 3,4774 90,120
Output data 193 2,970 14,018 57,520
written(TB)
Average worker 157 268 394 488
* machines Design, Lessons, Advices from Building Large Distributed System, Keynote , LADIS 2009.
source: J. Dean,

* Hadoop won the 1st in GraySort benchmark for 100 TB sorting with over
3,800 nodes – Winning a 60 sencond Dash with a Yellow Elephant, http://sortbenchmark.org/Yahoo2009.pdf

Single Node Architecture

CPU

Memory

Disk


Commodity Clusters
♦ Web data sets can be very large
– Tens to hundreds of terabytes
– At Facebook, almost 6TB of new log data is collected every day,
with 1.7PB of log data accumulated over time*
*source: A comparison of join algorithms for log processing in MapReduce, SIGMOD’10

♦ We cannot store and process that size of data on a single
machine in time
♦ Standard architecture emerging:
– Cluster of commodity Linux nodes
– Gigabit Ethernet interconnects
♦ How to organize computations on this architecture?
– Mask issues such as hardware failure

Cluster Architecture
8 Gbps backbone between racks
1 Gbps between Switch
any pair of nodes
in a rack
Switch Switch

CPU CPU CPU CPU

Mem … Mem Mem … Mem

Disk Disk Disk Disk

Yahoo clusters that is used for GraySort:
• Each rack contains 40 nodes
• 2 quad core Xeons @ 2.5ghz per node
• 8GB RAM, 4 SATA Copyright © KAIST Database Lab. All Rights Reserved.
HDD

The Need of Stable Storage
♦ Problem: if nodes can fail, how can we store data
persistently?
– Cheap nodes fail frequently, if you have many
» MTBF for 1 node = 3 years
» MTBF for 1000 nodes = 1 day in average
– Putting fault-tolerance into system
♦ Answer: Distributed File System
– Provides global file namespace
– Google GFS; Hadoop HDFS
– Typical usage pattern
» Huge files (100s of GB to TB)
» Data is rarely updated in place
» Reads and appends are common I/O patterns

GFS Design

♦ Master manages metadata
♦ Data transfers happen directly between clients/chunk servers
♦ Files broken into chunks (typically 64 MB)
♦ Data replication (typically 3 replicas, Primary copy)
♦ Immutable data blocks

Google Cluster Environment
♦ Cluster is 1000s of machines, typically one or handful of
configurations
♦ File system (GFS) + cluster scheduling system are core services
♦ Typically 100s to 1000s of active jobs (some w/1 task, some
w/1000s)
♦ Mix of batch and low-latency, user-facing production jobs


Motivation of MapReduce’s Design
♦ Large-Scale Data Processing
– Want to use 1,000s of CPUs
» But don’t want hassle of managing things

♦ MapReduce Architecture provides
– Automatic parallelization & distribution
– Fault tolerance
– I/O scheduling
– Monitoring & status updates


What is MapReduce?
♦ Both a programming model and a framework for
massive parallel processing of large datasets across
many low-end nodes
– Popularized and controversially patented by Google Inc.
– Analogous to Group-By-Aggregation in DBMS
♦ Easy to distribute a job across nodes
– Implements data parallelism
♦ No hassle of managing jobs across nodes
♦ Nice retry/failure semantics
♦ Runtime scheduling with speculative execution


Programming model : Map/Reduce
♦ Input: a set of key/value pairs
♦ A user implements two functions:
– map(key1, value1)  (key2, value2)
– reduce(key2, list(value2))  (key3, value3)
♦ (key2, value2) is an intermediate key/value pair
♦ Output is the set of (k3,v3) pairs

♦ Many problems can be phrased in this way
– but not for all.


Data
♦ Input and final output are stored on DFS
– Scheduler tries to schedule map tasks ―close‖ to physical
storage location of input data
♦ Intermediate results are stored on local disks of map
and reduce workers
♦ Outputs of a MR job often become inputs of another
MR job


Parallel Execution across Nodes
1. Partition input key/value pairs into chunks and then
run map() tasks in parallel
2. After all map()s are complete, consolidate all emitted
values for each unique emitted key
3. Now partition space of output map keys, and run
reduce() in parallel
4. In reduce(), values for each key are grouped together
then aggregated, reduced output are stored on DFS


Example : Word Count
map(key, value):
// key: document name; value: text of document
for each word w in value:
emit(w, 1)

reduce(key, values):
// key: a word; values: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)


Execution: The Map Step
Input Intermediate
key-value pairs key-value pairs

k v
map
k v
k v
map
k v
k v

… …

k v k v


Execution: The Reduce Step
Output
Intermediate Key-value groups key-value pairs
key-value pairs
reduce
k v k v v v k v
reduce
k v k v v k v
group

k v
… …
…

k v k v k v


Combiner
♦ Often a map task will produce many pairs of the form
(k,v1), (k,v2), … for the same key k
– E.g., popular words in Word Count

♦ It can save network time by pre-aggregating at
mapper
– combine(k1, list(v1))  v2
– Usually same as reduce function

♦ Works only if reduce function is commutative and
associative


Example: Building an Inverted Index
♦ Input: (filename, text) records
♦ Output: list of files containing each word

♦ Map:
foreach word in text.split():
emit (word, filename)

♦ Combine: uniquify filenames for each word

♦ Reduce:
def reduce(word, filenames):
output(word, sort(filenames))


hamlet.txt
to, hamlet.txt
to be or be, hamlet.txt
not to be or, hamlet.txt afraid, (12th.txt)
not, hamlet.txt be, (12th.txt, hamlet.txt)
greatness, (12th.txt)
not, (12th.txt, hamlet.txt)
of, (12th.txt)
be, 12th.txt or, (hamlet.txt)
12th.txt
not, 12th.txt to, (hamlet.txt)
be not afraid, 12th.txt
afraid of of, 12th.txt
greatness
greatness, 12th.txt

*source: PARLab Parallel Boot Camp, Matei Zaharia


Distributed Execution Review
Input
Block 1 Block 2 Block 3 ... Block n

Map
Local sort
Mapper Mapper Mapper
Combiner

Intermediate
result
Barrier
pull

Copy/Shuffle
Reduce
Merge Reducer
Reduce

Output

System Behavior on a Single Node

*Source: A comparison of
join Algorithms for log
processing in MR,
SIGMOD’10


Experimental Results

*Source: A patform for scalable one-pass analytics using MapReduce, SIGMOD’11

Fault Tolerance
♦ If tasks fail, the tasks are executed again in another node
– Detect failure via periodic heartbeats
– Re-execute in-progress map tasks
– Re-execute in-progress reduce tasks
♦ If a node crashes:
– Re-launch its current tasks on other nodes
– Re-run any maps the node previously ran
» Necessary because their output files were lost along with the
crashed node
♦ If a task is going slowly (straggler):
– Launch second copy of task on another node (―speculative
execution‖)
– Take the output of whichever copy finishes first, and kill the
other


Criticism
♦ D. DeWitt and M. Stonebraker badly criticized that
―MapReduce is a major step backwards‖[5].
– He first regarded it as a simple Extract-Transform-Load tool.
♦ A technical comparison was done by Pavlo and et
al.[6]
– Compared with a commercial row-wise DBMS and Vertica
– After that, technical debates btw. researchers vs.
practitioners are triggered
♦ CACM welcomed this technical debate, inviting both
sides in The Communications of ACM, Jan 2010[7,8]


*Source: A Comparison of
Approaches to Large-Scale
Data Analysis, SIGMOD’09


Advantages
♦ Simple and easy to use
– Users code only Map() and Reduce()
– Users need not to consider how to distribute their job
♦ Flexible
– No data model, no schema
– Users can treat any irregular data with MapReduce
♦ Independent of the storage
♦ Fault tolerance
– Users need not to worry about faults during running
– Each run does not start from Map()
♦ High scalability
– Easy to scale-out


Caveats
♦ A Single fixed dataflow
♦ Lack of schema, index, and high-level language
– Requires data parsing and full scan,
– no separation from apps.
♦ Sacrifice of disk I/O for fault-tolerance
– Materialization of intermediate results on local disks
– Three replicas on DFS
– I/O inefficient!
♦ Blocking operators
– Caused by merge-sort for grouping values
– Reduce begins after all map tasks end
♦ A simple heuristic runtime scheduling with speculative execution
♦ Very young!
– Few third party tools and low efficiency

A Short List of Related Study
♦ Sacrifice of disk I/O for fault- ♦ A simple heuristic scheduling
tolerance – LATE, …
– Main difference against DBMS ♦ Relatively poor performance
♦ A single fixed dataflow – Adaptive and automatic performance
– Dryad, SCOPE, Nephele/PACT tuning.
– Map-Reduce-Merge for binary – Work sharing/Multiple jobs
operators » MRShare: Multi query processing
– Twister and HaLoop for iterative » Hive, Pig Latin
workload » fair/capacity sharing, ParaTimer

– Map-Join-Reduce and some join – Map-Join-Reduce
techniques – Join algorithms in MapReduce[Blanas-
SIGMOD’10]
♦ No schema
– Protocol buffer, JSON, XML, …. ♦ Cowork with other tools
– SQL/MapReduce, HadoopDB, Teradata
♦ No indexing
EDW’s Hadoop integration, ….
– HadoopDB, Hadoop++
♦ DBMS based on MR
♦ No high-level language
– Cheetah, Osprey, RICARDO(analytic
– Hive, Sawzall, SCOPE, Pig Latin, … , tool)
Jaql, Dryad/LINQ
♦ Other complements
♦ Blocking operators
– DREMEL, …
– MapReduce Online, Mortar


A Brief Bibliographic Survey

• We intend to assist DB and
open source communities in
understanding various technical
aspects of the MapReduce
framework

• SIGMOD Record 40(4):11—20,
Dec 2011


Summary
♦ MR is simple, but provides good scalability and fault-
tolerance for massive data processing
♦ MR is unlikely to substitute DBMS
♦ MR complements DBMS with scalable and flexible
parallel processing for various data analysis
♦ I/O efficiency of MapReduce still needs to be
addressed for more successful implications
– sort-merge based grouping and frequent checkpoints
♦ Many application domains and room for improvement


Other Research Challenges and
Issues
♦ Parallelizing conventional algorithms
– that require filtering-then-aggregation.
» But, not good for ad-hoc queries
♦ Performance Improvements
– Not so well utilize the modern HW features
» Multi-core, GPGPU, SSD, etc
– Some caveats still exist in the model
» iterative and incremental processing
– Self-tuning
» 150+ tuning knobs in Hadoop
» Long-running analysis and batch processing


Thank you!
Questions or comments?


References
1. David A. Patterson. Technical perspective: the data center is the computer. Communications of ACM, 51(1):105,
2008.
2. Hadoop. users List; http://wiki.apache.org/hadoop/PoweredBy
3. Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data, Processing on Large Clusters, In Proceedings of
OSDI 2004 and CACM Vol. 51, No. 1 pp. 107-113, 2008
4. S. Ghemawat and et al. The Google File System, ACM SIGOPS Operating Systems Review, Vol. 37, No. 5 pp. 29-
43, 2003
5. David J. DeWitt and Michael Stonebraker, MapReduce: a major step backwards, Database column blog, 2008
6. Andrew Pavlo and et al. A Comparison of Approaches to Large-Scale Data Analysis, In Proceedings of SIGMOD
2009
7. Michael Stonebraker and et al. MapReduce and Parallel DBMSs: Friends or Foes?, Communications of ACM, Vol
53, No. 1 pp. 64-71, Jan 2010
8. Jeffrey Dean and Sanjay Ghemawat, MapReduce: A Flexible Data Processing Tool, Communications of the ACM,
Vol. 53, No. 1 pp. 72-72 Jan 2010
9. M. Stonebraker, The case for shared-nothing. Data Engineering Bulletine, 9(1):4-9, 1986
10. D. DeWitt and J. Gray, Parallel database systems: the future of high performance database systems,
Communications of the ACM 35(6):85-98, 1992
11. B. Schroeder and et a. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you. In
Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST), pages 1–16, 2007.
12. B. Schroeder and et al. DRAM errors in the wild: a large-scale field study. In Proceedings of the eleventh
international joint conference on Measurement and modeling of computer systems, pages 193–204. ACM New York,
NY, USA, 2009


13. G.M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In
Proceedings of the April 18-20, 1967, spring joint computer conference, pages 483–485. ACM, 1967.
14. J.L. Gustafson. Reevaluating Amdahl’s law. Communications of the ACM, 31(5):532–533, 1988.
15. A.H. Karp and H.P. Flatt. Measuring parallel processor performance. Communications of the
ACM, 33(5):539–543, 1990.
16. Apache Foundation, MapReduce V0.21.0
Tutorial, http://hadoop.apache.org/mapreduce/docs/r0.21.0/mapred_tutorial.html, 2010
17. Incremental MapReduce, TV’s cobweb blog, http://eagain.net/articles/incremental-mapreduce/
18. Y. Bu and et al. HaLoop: Efficient Iterative Data Processing on Large Clusters, In Proceedings of VLDB’10
19. J. Ekanayake and et al. Twister: A Runtime for Iterative MapReduce, In Proceedings of ACM HPDC’10 pp.
810-818, 2010
20. M. Isard and et al. Dryad: Distributed data-parallel programs from sequential building blocks. In
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, page
72. ACM, 2007.
21. R. Chaiken and et al. Scope: easy and efficient parallel processing of massive data sets. PVLDB:
Proceedings of Very Large Data Base Endowment, 1(2):1265–1276, 2008.
22. C. Olston and et al. Pig Latin: a not-so-foreign language for data processing. In SIGMOD ’08: Proceedings
of ACM SIGMOD Conference, pages 1099–1110, 2008.
23. A. Gates and et al. Building a high level dataflow system on top of MapReduce: The pig experience.
PVLDB: In Proceedings of VLDB, 2(2):1414–1425, 2009.
24. R. Pike and et al. Interpreting the Data: Parallel Analysis with Sawzall, Scientific Programming, Vol. 13 No.
4, pp. 277-298, 2005
25. A. Thusoo and et al. Hive- A Warehousing Solution over a Map-Reduce Framework. PVLDB: Proceedings
of Very Large Data Base Endowment, 2009
26. A. Thusoo and et al. Hive - a petabyte scale data warehouse using hadoop. In Proceedings of ICDE 2010


27. Y. Yu and et al. DryadLINQ: A system for general-purpose distributed data-parallel computing using a
high-level language. In OSDI ’08: Proceedings of Symposium on Operating System Design and
Implementation, 2008
28. M. Isard and et al. Distributed Data-Parallel Computing Using a High-Level Programming Language, In
Proceedings of SIGMOD 2009
29. D. Logothetis and et al. Ad-Hoc Data Processing in the Cloud, In Proceedings of VLDB’08
30. T. Condie and et al. MapReduce Online, In Proceedings of USENIX NSDI, 2010
31. A. Alexandrov and et al. Massively Parallel Data Analysis with PACTs on Nephele, In Proceedings of
VLDB Vol. 3 No.2, 2010
32. Battr{'e}, D and et al. Nephele/PACTs: a programming model and execution framework for web-scale
analytical processing, In Proceedings of SoCC 2010
33. Eric Friedman and et al. SQL/MapReduce: A practical approach to self-describing, polymorphic, and
parallelizable user defined functions. PVLDB: PVLDB: Proceedings of Very Large Data Base Endowment,
2(2):1402–1413, 2009.
34. A. Abouzeid and et al. HadoopDB: An architectural hybrid of mapreduce and dbms technologies for
analytical workloads. VLDB’09: Proceedings of Very Large Data Base Endowment, pages 1084–1095,
2009.
35. Y. Xu and et al. Integrating Hadoop and Parallel DBMS, In Proceedings of ACM SIGMOD, pp. 969-974,
2010
36. S. Das and et al. Ricardo: Integrating R and Hadoop, In Proceedings of ACM SIGMOD pp. 987-998, 2010
37. J. Dittrich and et al. Hadoop++ Making a Yellow Elephant Run like a Cheetah (Without it Even Noticing), In
Proceedings of VLDB’10
38. S. Chen, Cheetah: A High Performance Custom Data Warehouse on top of MapReduce, In Proceedings
of VLDB, Vol. 3, No. 2, 2010
39. S. Melnik and et al. Dremel: Interactive Analysis of Web-Scale Datasets, In Proceedings of VLDB VOl 3.
No .1, 2010


40. C. Yang and et al. Osprey-Implementing MapReduce-Style Fault Tolerance in a Shared-Nothing
Distributed Databasem, In Proceedings of IEEE ICDE pp. 657-668, 2010
41. M. Zaharia and et al. Improving MapReduce Performance in Heterogeneous Environments, In
Proceedings of USENIX OSDI’08
42. H. Yang, and et al., Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters, In
Proceedings of SIGMOD’07
43. D. Jiang and et al. Map-Join-Reduce: Towards Scalable and Efficient Data Analysis on Large
Clusters, IEEE Transactions on Knowledge and Data Engineering, preprint
44. S. Blanas and et al. A Comparison of Join Algorithms for Log Processing in MapReduce, In Proceedings
of SIGMOD’10
45. F. N. Afrati and et al. Optimizing Joins in a Map-Reduce Environment, in Proceedings of EDBT 2010
46. R. Vernica and et al. Efficient Parallel Set-Similarity Joins Using MapReduce, In Proceedings of
SIGMOD’10
47. T. Nykiel and et al. MRShare: Sharing Across Multiple Queries in MapReduce, In Proceedings of VLDB’10
48. K. Morton and et al. Estimating the progress of MapReduce Pipelines, In Proceedings of IEEE ICDE pp.
681-684, 2010
49. K. Morton and et al. ParaTimer: A Progress Indicator for MapReduce DAGs, In Proceedings of ACM
SIGMOD, pp. 507-518, 2010
50. S. Papadimitriou and et al. DisCo: Distributed Co-clustering with Map-Reduce, In Proceedings of IEEE
ICDM pp. 512-521, 2009
51. C. Wang and et al. MapDupReducer : detecting near duplicates over massive datasets, In Proceedings of
ACM SIGMOD pp. 1119-1122, 2010
52. S. Babu, Towards Automatic Optimization of MapReduce Programs, In Proceedings of ACM SoCC’10
53. D. Jiang and et al. The Performance of MapReduce: An In-depth Study, In Proceedings of VLDB’10
54. E. Jahani and et al. Automatic Optimization for MapReduce Programs, Proceedings of VLDB Vol.4, No.
6 , 2011


55. B. Catanzaro and et al. A Map Reduce Framework for Programming Graphic Processors, In Proceedings
of Workshop on Software Tools for Multicore Systems, 2008
56. B. He and et al. Mars: A MapReduce framework on graphic processors, In Proceedings of PACT’10 pp.
260-269, 2008

57. W. Jiang and et al. A Map-Reduce System with an Alternate API for Multi-Core Environments, In
Proceedings of 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, 2010
58. Jeff Dean, Design, Lessons, Advices from Building Large Distributed System, Keynote , LADIS 2009.
59. Willis Lang and et al. Energy Management for MapReduce Clusters, Proceedings of VLDB Vol. 3 No. 1,
2010
60. W. Xiong and et al. Energy Efficient Data Intensive Distributed Computing, Data Engineering Bulletin Vol.
34, No. 1, pp. 24-33, March 2011
61. E. Anderson and et al. Efficiency Matters!, ACM SIGOPS Operating Systems Review, 44(1):40-45, 2010
62. Jimmy Lin and Chris Dyer, Data-Intensive Text Processing, Book
63. G. Malewicz and et al. Pregel: A System for Large-Scale Graph Processing, In Proceedings of PODC’09
64. J. Ekanayake and et al. MapReduce for Data Intensive Scientific Analyses, In Proceedings of IEEE
eScience’08
65. K. B. Hall and et al. MapReduce/BigTable for Distributed Optimization , NIPS LCCC Workshop 2010
66. PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce
67. MC Schatz, CloudBurst: Highly sensitive read mapping with MapReduce, Bioinformatics, Vol 25, No. 11
68. B. Fan and et al. DiskReduce: RAID for data-intensive scalable computing, In Proceedings of the 4th
Annual workshop on Petascale Data Storage, pp. 6-10, 2009
69. K. Lee and et al. Parallel data processing with MapReduce: a survey, The SIGMOD Record, Vol 40, No. 4,
pp.11-20, 2011


MapReduce: A useful parallel tool that still has room for improvement

More Related Content

What's hot

Similar to MapReduce: A useful parallel tool that still has room for improvement

More from Kyong-Ha Lee

Recently uploaded

MapReduce: A useful parallel tool that still has room for improvement