Hadoop uses a MapReduce programming model to process and generate large data sets across clusters of machines. The model divides tasks into independent map and reduce processes. Mappers process input key-value pairs to generate intermediate key-value pairs. Reducers then merge all intermediate values associated with the same key. This allows automatic parallelization and distribution of computations on large datasets.
7. MapReduce: Simplified Dat
a Processing on Large Clusters
Jeffrey Dean and Sanjay Ghe
mawat
jeff@google.com, sanjay@goo
gle.com
Google, Inc.
Abstract
given day, etc. Most such comp
MapReduce is a programming utations are conceptu-
model and an associ- ally straightforward. However,
ated implementation for proce the input data is usually
ssing and generating large large and the computations have
data sets. Users specify a map to be distributed across
function that processes a hundreds or thousands of mach
key/value pair to generate a set ines in order to finish in
of intermediate key/value a reasonable amount of time.
pairs, and a reduce function that The issues of how to par-
merges all intermediate allelize the computation, distri
values associated with the same bute the data, and handle
intermediate key. Many failures conspire to obscure the
real world tasks are expressible original simple compu-
in this model, as shown tation with large amounts of
in the paper. complex code to deal with
these issues.
Programs written in this funct As a reaction to this complexity
ional style are automati- , we designed a new
cally parallelized and executed abstraction that allows us to expre
on a large cluster of com- ss the simple computa-
modity machines. The run-time tions we were trying to perfo
system takes care of the rm but hides the messy de-
details of partitioning the input tails of parallelization, fault-
data, scheduling the pro- tolerance, data distribution
gram’s execution across a set and load balancing in a librar
of machines, handling ma- y. Our abstraction is in-
chine failures, and managing spired by the map and reduce
the required inter-machine primitives present in Lisp
communication. This allows and many other functional langu
programmers without any ages. We realized that
experience with parallel and most of our computations invol
distributed systems to eas- ved applying a map op-
ily utilize the resources of a large eration to each logical “record”
distributed system. in our input in order to
Our implementation of MapR compute a set of intermediat
educe runs on a large e key/value pairs, and then
cluster of commodity machines applying a reduce operation to
and is highly scalable: all the values that shared
a typical MapReduce computatio the same key, in order to comb
n processes many ter- ine the derived data ap-
abytes of data on thousands of propriately. Our use of a funct
machines. Programmers ional model with user-
find the system easy to use: hund specified map and reduce opera
reds of MapReduce pro- tions allows us to paral-
grams have been implemented lelize large computations easily
and upwards of one thou- and to use re-execution
sand MapReduce jobs are execu as the primary mechanism for
ted on Google’s clusters fault tolerance.
every day. The major contributions of this
work are a simple and
powerful interface that enables
automatic parallelization
and distribution of large-scal
e computations, combined
1 Introduction with an implementation of this
interface that achieves
high performance on large clust
ers of commodity PCs.
Over the past five years, the Section 2 describes the basic
authors and many others at programming model and
Google have implemented hund gives several examples. Secti
reds of special-purpose on 3 describes an imple-
computations that process large mentation of the MapReduce
amounts of raw data, interface tailored towards
such as crawled documents, our cluster-based computing
web request logs, etc., to environment. Section 4 de-
compute various kinds of deriv scribes several refinements of
ed data, such as inverted the programming model
indices, various representations that we have found useful. Secti
of the graph structure on 5 has performance
of web documents, summaries measurements of our implement
of the number of pages ation for a variety of
crawled per host, the set of tasks. Section 6 explores the
most frequent queries in a use of MapReduce within
Google including our experienc
es in using it as the basis
To appear in OSDI 2004
1
8. Jeffrey Dean and Sanjay
Ghemawat
jeff@google.com, sanjay
@ google.com
Google, Inc.
Abstract
given day, etc. Most su
MapReduce is a progra ch computations are co
mming model and an as ally straightforward. Ho nceptu-
ated implementation fo soci- wever, the input data is
r processing and genera large and the computatio usually
data sets. Users specify ting large ns have to be distributed
a map function that proc hundreds or thousands across
key/value pair to genera esses a of machines in order to
te a set of intermediate ke a reasonable amount of finish in
pairs, and a reduce func y/value time. The issues of how
tion that merges all inter allelize the computatio to par-
values associated with th mediate n, distribute the data, an
e same intermediate key. failures conspire to obsc d handle
real world tasks are expr Many ure the original simple
essible in this model, as tation with large amount compu-
in the paper. shown s of complex code to de
these issues. al with
Programs written in this As a reaction to this co
functional style are auto mplexity, we designed
cally parallelized and ex mati- abstraction that allows us a new
ecuted on a large cluste to express the simple co
modity machines. The r of com- tions we were trying to mputa-
run-time system takes ca perform but hides the m
details of partitioning th re of the tails of parallelization, essy de-
e input data, scheduling fault-tolerance, data distr
gram’s execution across the pro- and load balancing in ibution
a set of machines, hand a library. Our abstractio
chine failures, and man ling ma- spired by the map and n is in-
aging the required inter reduce primitives presen
communication. This all -machine and many other functio t in Lisp
ows programmers with nal languages. We reali
experience with paralle out any most of our computatio zed that
l and distributed system ns involved applying a
ily utilize the resources s to eas- eration to each logical map op-
of a large distributed sy “record” in our input in
stem. compute a set of interm order to
Our implementation of ediate key/value pairs,
MapReduce runs on a and then
cluster of commodity m large applying a reduce oper
achines and is highly sc ation to all the values th
a typical MapReduce co alable: the same key, in order at shared
mputation processes m to combine the derived
abytes of data on thousa any ter- propriately. Our use of data ap-
nds of machines. Progra a functional model with
find the system easy to us mmers specified map and redu user-
e: hundreds of MapRedu ce operations allows us
grams have been implem ce pro- lelize large computatio to paral-
ented and upwards of on ns easily and to use re-e
sand MapReduce jobs ar e thou- as the primary mechani xecution
e executed on Google’s sm for fault tolerance.
every day. clusters The major contributions
of this work are a simpl
powerful interface that e and
enables automatic paralle
and distribution of large lization
1 Introduction -scale computations, co
with an implementatio mbined
n of this interface that
high performance on lar achieves
ge clusters of commodity
Over the past five years, Section 2 describes the PCs.
the authors and many ot basic programming mod
Google have implemen hers at gives several examples el and
ted hu
9. Abstract
given da
MapReduce is a progra ally strai
mming model and an a
ated implementation fo ssoci-
r processing and genera large and
data sets. Users specify ting large
a map function that pro hundreds
key/value pair to genera cesses a
te a set of intermediate k a reasona
pairs, and a reduce func ey/value
tion that merges all inte allelize th
values associated with th rmediate
e same intermediate key failures co
real world tasks are exp . Many
ressible in this model, a tation wit
in the paper. s shown
these issue
Programs written in this As a re
functional style are auto
cally parallelized and ex mati- abstraction
ecuted on a large cluste
modity machines. The r o f co m - tions we w
run-time system takes c
details of partitioning th are of the tails of par
e input data, scheduling
gram’s execution across the pro- and load b
a set of machines, hand
chine failures, and man ling ma- spired by th
aging the required inter-
communication. This a machine and many o
llows programmers wit
experience with paralle hout any most of ou
l and distributed system
il s to eas-
10. expressible in this mod tation wit
in the paper. e l, as shown
these issu
Programs written in this As a re
functional style are auto
cally parallelized and ex mati- abstractio
ecuted on a large cluste
modity machines. The r o f co m - tions we w
run-time system takes c
details of partitioning th are of the tails of pa
e input data, scheduling
gram’s execution across the pro- and load b
a set of machines, hand
chine failures, and man ling ma- spired by t
aging the required inter-
communication. This a machine and many
llows programmers wit
experience with paralle hout any most of ou
l and distributed system
ily utilize the resources s to eas- eration to e
of a large distributed sy
stem. compute a
Our implementation of
MapReduce runs on a
cluster of commodity m large applying a
achines and is highly s
a typical MapReduce c calable: the same ke
omputation processes m
abytes of data on thous any ter- propriately.
ands of machines. Prog
find the system easy to u rammers specified m
se: hundreds of MapRed
grams have been implem uce pro- lelize large
ented and upwards of o
sand MapReduce jobs a ne thou- as the prima
re executed on Google’s
every day. clusters The major
powerful int
and distribut
11. MapReduce history
“ ing mod el and
A pr ogramm
ion for p rocessing
imp lementat
”
g large d ata sets
and generatin
34. server Funerals
No pagers go off when machines die
Report of dead machines once a week
Clean out the carcasses
35. utes pre vented
obustness attrib n code
R
g into a pplicatio
from bleedin
Data redundancy
Node death
Retries
Data geography
Parallelism
Scalability
67. Not right now...
Do you expect to tackle a very
large problem before you:
change jobs
change industries
retire
die
see the heat death of the universe
68. In the next decade, the class (scale) of
problems we are aiming to solve will
grow exponentially.
73. The process
Every item in dataset is parallel candidate for
Map
Map(k1,v1) -> list(k2,v2)
74. The process
Every item in dataset is parallel candidate for
Map
Map(k1,v1) -> list(k2,v2)
Collects and groups pairs from all lists by key
75. The process
Every item in dataset is parallel candidate for
Map
Map(k1,v1) -> list(k2,v2)
Collects and groups pairs from all lists by key
Reduce in parallel on each group
76. The process
Every item in dataset is parallel candidate for
Map
Map(k1,v1) -> list(k2,v2)
Collects and groups pairs from all lists by key
Reduce in parallel on each group
Reduce(k2, list (v2)) -> list(v3)
77. FP For the Grid
MapReduce
Functional programming
on
a distributed processing platform
110. Pig Basics
Yahoo-authored add-on DSL & tool
Origin: Pig Latin
Analyzes large data sets
High-level language for expressing data
analysis programs
111. PIG Questions
Ask big questions on unstructured
data
How many ___?
Should we ____?
Decide on the questions you want to
ask long after you`ve collected the
data.
112. Pig Sample
A = load 'passwd' using PigStorage(':');
B = foreach A generate $0 as id;
dump B;
store B into 'id.out';
133. EMR Functions
RunJobFlow: Creates a job flow request, starts
EC2 instances and begins processing.
DescribeJobFlows: Provides status of your job
flow request(s).
AddJobFlowSteps: Adds additional step to an
already running job flow.
TerminateJobFlows: Terminates running job flow
and shutdowns all instances.
134. EMR Functions
RunJobFlow: Creates a job flow request, starts
EC2 instances and begins processing.
DescribeJobFlows: Provides status of your job
flow request(s).
AddJobFlowSteps: Adds additional step to an
already running job flow.
TerminateJobFlows: Terminates running job flow
and shutdowns all instances.
145. Hadoop
Divide and conquer gigantic data
Matthew McCullough
Email matthewm@ambientideas.com
Twitter @matthewmccull
Blog http://ambientideas.com/blog