The document discusses how data is growing faster than individual machines can scale, requiring work to be spread across many machines. It introduces MapReduce as an approach to tackle large datasets by moving computation to where the data is located. MapReduce provides a way to break problems into independent parallelizable tasks by splitting data, processing it in a map phase where computation is done locally, and then shuffling and reducing the data in a parallel fashion. The document advocates building higher-level functions on top of MapReduce as a basic building block.
14. 10
We
need
to
put
the
picture
back
together
again
Be1er-‐informed
decisions
15. 10
We
need
to
put
the
picture
back
together
again
Be1er-‐informed
decisions
Reduce
systemaKc
fricKon
16. 10
We
need
to
put
the
picture
back
together
again
Be1er-‐informed
decisions
Understand
and
improve
the
health
of
populaKons
Reduce
systemaKc
fricKon
31. PopulaKon
Health
- Securely
bring
together
health
data
- IdenKfy
opportuniKes
to
improve
care
- Support
applicaKon
of
improvements
32. PopulaKon
Health
- Securely
bring
together
health
data
- IdenKfy
opportuniKes
to
improve
care
- Support
applicaKon
of
improvements
- Close
the
loop
38. Rela%onal
Databases
are
Awesome
Atomic,
transacKonal
updates
DeclaraKve
queries
Guaranteed
consistency
Easy
to
reason
about
Long
track
record
of
success
50. “Shared
Nothing”
architectures
are
the
most
scalable…
…but
most
real-‐world
problems
require
us
to
share
something…
…so
our
designs
usually
have
a
parallel
part
and
a
serial
part
51. The
key
is
to
make
sure
the
vast
majority
of
our
work
in
the
cloud
is
independent
and
parallelizable.
52. Amdahl’s
Law
S
:
speed
improvement
P
:
raKo
of
the
problem
that
can
be
parallelized
N:
number
of
processors
53. MapReduce
Primer
Input
Data
Split
1
Split
2
Split
3
Split
N
.
.
.
Mapper
1
Mapper
2
Mapper
3
Mapper
N
.
.
.
Map
Phase
Reducer
1
Reducer
2
Reducer
N
.
.
Reduce
Phase
Shuffle
54. MapReduce
Example:
Word
Count
Books
Count
words
per
book
.
.
.
Map
Phase
Sum
words
A-‐C
.
.
Reduce
Phase
Shuffle
Sum
words
D-‐E
Sum
words
W-‐Z
Count
words
per
book
Count
words
per
book
64. Grouping
and
AggregaKng
Books
Count
words
per
book
.
.
.
Map
Phase
Sum
words
A-‐C
.
.
Reduce
Phase
Shuffle
Sum
words
D-‐E
Sum
words
W-‐Z
Count
words
per
book
Count
words
per
book
65. Joins
Data
Set
1
Split
1
Split
2
Split
3
Group
by
key
Map
Phase
Reducer
1
Reducer
2
Reducer
N
.
.
Reduce
Phase
Shuffle
Group
by
key
Group
by
key
Data
Set
2
Split
1
Split
2
Split
3
Group
by
key
Group
by
key
Group
by
key
66. Persons
Split
1
Split
2
Split
3
Group
by
person
id
Map
Phase
Reducer
1
Reducer
2
Reducer
N
.
.
Reduce
Phase
Shuffle
Group
by
person
id
Group
by
person
id
Visits
Split
1
Split
2
Split
3
Group
by
person
id
Group
by
person
id
Group
by
person
id
Joins
67. Map-‐Side
Joins
Data
Set
1
Split
3
Mapper
3
Map
Phase
Reducer
1
Reducer
2
.
.
Reduce
Phase
Shuffle
Data
set
2
Split
1
Mapper
1
Data
set
2
Split
2
Mapper
2
Data
set
2
70. 46
Chain
Jobs
Together
Large-‐scale
joins
must
have
a
reduce
phase
MulKple
joins
or
group-‐by
operaKons
mean
mulKple
jobs
Normalize
Data
Join
Related
Items
Compute
Summary Output
71. Codified
in
High-‐Level
Libraries
Hive,
Pig,
Cascading,
and
Crunch
provide
simple
means
to
use
these
pa1erns
Apache
Crunch
The
era
of
wriKng
MapReduce
by
hand
is
over
80. 52
Don’t
be
afraid
to
re-‐process
the
world
Something’s
wrong,
we’re
above
95%
usage!
-‐TradiKonal
System
Administrator
81. 52
Don’t
be
afraid
to
re-‐process
the
world
Something’s
wrong,
we’re
above
95%
usage!
-‐TradiKonal
System
Administrator
Something’s
wrong,
we’re
below
95%
usage!
-‐Hadoop
System
Administrator
83. 54
Franklin,
Halevy,
Maier,
h1p://homes.cs.washington.edu/~alon/files/dataspacesDec05.pdf
From
Databases
to
Dataspaces
84. 54
Franklin,
Halevy,
Maier,
h1p://homes.cs.washington.edu/~alon/files/dataspacesDec05.pdf
From
Databases
to
Dataspaces
(Also
referred
to
as
Data
Lakes)
85. 55
Franklin,
Halevy,
Maier,
h1p://homes.cs.washington.edu/~alon/files/dataspacesDec05.pdf
Bring
all
of
your
data
together...
86. 55
Franklin,
Halevy,
Maier,
h1p://homes.cs.washington.edu/~alon/files/dataspacesDec05.pdf
Bring
all
of
your
data
together...
..structured
or
unstructured...
87. 55
Franklin,
Halevy,
Maier,
h1p://homes.cs.washington.edu/~alon/files/dataspacesDec05.pdf
Bring
all
of
your
data
together...
...transform
it
with
unlimited
computaKon...
..structured
or
unstructured...
88. 55
Franklin,
Halevy,
Maier,
h1p://homes.cs.washington.edu/~alon/files/dataspacesDec05.pdf
Bring
all
of
your
data
together...
...transform
it
with
unlimited
computaKon...
...at
any
Kme
for
any
new
need.
..structured
or
unstructured...