Lecture 07 - CS-5040 - modern database systems

Modern
Database
Systems

Lecture
7

Aris6des
Gionis

Michael
Mathioudakis

Spring
2016

logis6cs

assignment
1

currently
marking

might
upload
best
student
solu6ons?

will
provide
annotated
pdfs
with
marks

virtualbox
has
same
path
&
ﬁlename
as
previous
one

Michael
Mathioudakis
2

con6nuing
from
last
lecture

original
paper
on
spark

Michael
Mathioudakis
3

previously...

on
modern
database
systems

generaliza6on
of
mapreduce

more
suitable
for
itera6ve
workﬂows

itera6ve
algorithms
and
repe66ve
querying

rdd:
resilient
distributed
dataset

read-‐only,
lazily
evaluated,
easily
re-‐created

ephemeral,
unless
we
need
to
keep
them
in
memory

Michael
Mathioudakis
4

example:
text
search

suppose
that
a
web
service
is
experiencing
errors

you
want
to
search
over
terabytes
of
logs
to
ﬁnd
the
cause

the
logs
are
stored
in
Hadoop
Filesystem
(HDFS)

errors
are
wriUen
in
the
logs
as
lines
that

start
with
the
keyword
“ERROR”

Michael
Mathioudakis
5

example:
text
search

HDFS errors
time fields
map(_.split(‘t’)(3))
Figure 1: Lineage graph for the third query in our example.
Boxes represent RDDs and arrows represent transformations.
lines = spark.textFile("hdfs://...")
errors = lines.filter(_.startsWith("ERROR"))
errors.persist()
Line 1 defines an RDD backed by an HDFS file (as a
collection of lines of text), while line 2 derives a filtered
RDD from it. Line 3 then asks for errors to persist in
memory so that it can be shared across queries. Note that
the argument to filter is Scala syntax for a closure.
At this point, no work has been performed on the clus-
ter. However, the user can now use the RDD in actions,
e.g., to count the number of messages:
F
S
m
W
p
B
e
Ta
2.
T
m
tr
te
a
in
errors.persist()
errors.count()
The user can also perform further transformations on
the RDD and use their results, as in the following lines:
// Count errors mentioning MySQL:
errors.filter(_.contains("MySQL")).count()
e
Ta
2.
To
m
tri
te
a
in
bu
gr
w
D
m
to
in
Scala...

rdd

rdd

from
a
file

transforma6on

hint:
keep
in
memory!

no
work
on
the
cluster
so
far

ac6on!

lines
is
not
loaded
to
ram!

Michael
Mathioudakis
6

example
-‐
text
search
ctd.

let
us
ﬁnd
errors
related
to
“MySQL”

Michael
Mathioudakis
7

example
-‐
text
search
ctd.

errors.persist()
errors.count()
m
W
p
B
e
Ta
2.
T
m
tr
te
a
in
bu
gr
w
errors.count()
// Return the time fields of errors mentioning
// HDFS as an array (assuming time is field
// number 3 in a tab-separated format):
errors.filter(_.contains("HDFS"))
2.
To
m
tri
te
a
in
bu
gr
w
D
m
to
R
gr
w
to
m
errors.count()
.map(_.split(’t’)(3))
.collect()
After the ﬁrst action involving errors runs, Spark will
memo
tribut
tems,
a glob
includ
but a
graine
which
DSM
make
tolera
Th
RDD
graine
writes
to app
more
need
be rec
partit
ure, a
transforma6on
ac6on

Michael
Mathioudakis
8

example
-‐
text
search
ctd.
again

let
us
find
errors
related
to
“HDFS”
and
extract

their
6me
field

assuming
6me
is
field
no.
3
in
tab-‐separated
format

Michael
Mathioudakis
9

example
-‐
text
search
ctd.
again

errors.persist()
errors.count()
Wo
pla
Beh
eno
Tab
2.3
To
mem
trib
tem
a gl
incl
but
grai
whi
DSM
mak
errors.count()
.collect()
To
mem
tribu
tem
a gl
incl
but
grai
whi
DSM
mak
tole
T
RDD
grai
writ
to a
mor
need
be r
errors.count()
.collect()
store the partitions of errors in memory, greatly speed-
ing up subsequent computations on it. Note that the base
RDD, lines, is not loaded into RAM. This is desirable
tributed
tems, ap
a global
include
but also
grained
which p
DSM is
makes
tolerant
The m
RDDs c
grained
writes t
to appli
more ef
need to
be reco
partition
ure, and
nodes, w
A sec
errors.count()
.collect()
store the partitions of errors in memory, greatly speed-
tems
a glo
inclu
but
grain
whic
DSM
mak
toler
Th
RDD
grain
write
to ap
more
need
be re
parti
ure,
node
transforma6ons

ac6on

Michael
Mathioudakis
10

example:
text
search

lineage
of
6me
ﬁelds

lines
errors
filter(_.startsWith(“ERROR”))
HDFS errors
time fields
filter(_.contains(“HDFS”)))
map(_.split(‘t’)(3))
cached

pipelined

transforma6ons
if
a
par66on
of
errors
is
lost,

ﬁlter
is
applied
only
the

corresponding
par66on
of
lines

Michael
Mathioudakis
11

represen6ng
rdds

internal
informa6on
about
rdds

par66ons
&
par66oning
scheme

dependencies
on
parent
RDDs

func6on
to
compute
it
from
parents

Michael
Mathioudakis
12

rdd
dependencies

narrow
dependencies

each
par66on
of
the
parent
rdd
is
used
by

at
most
one
par66on
of
the
child
rdd

otherwise,
wide
dependencies

Michael
Mathioudakis
13

rdd
dependencies

union
groupByKey
join with inputs not
co-partitioned
join with inputs
co-partitioned
map, filter
Narrow Dependencies: Wide Dependencies:
Figure 4: Examples of narrow and wide dependencies. EachMichael
Mathioudakis
14

scheduling

when
an
ac6on
is
performed...

(e.g.,
count()
or
save())

...
the
scheduler
examines
the
lineage
graph

builds
a
DAG
of
stages
to
execute

each
stage
is
a
maximal
pipeline
of

transforma6ons
over
narrow
dependencies

Michael
Mathioudakis
15

scheduling

join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
Figure 5: Example of how Spark computes job stages. Boxes
with solid outlines are RDDs. Partitions are shaded rectangles,
in black if they are already in memory. To run an action on RDD
rdd

par66on

already
in
ram

Michael
Mathioudakis
16

memory
management

when
not
enough
memory

apply
LRU
evic6on
policy
at
rdd
level

evict
par66on
from
least
recently
used
rdd

Michael
Mathioudakis
17

performance

logis6c
regression
and
k-‐means

amazon
EC2

10
itera6ons
on
100GB
datasets

100
node-‐clusters

Michael
Mathioudakis
18

performance

-
e
m
-
r
-
e
n
80!
139!
46!
115!
182!
82!
76!
62!
3!
106!
87!
33!
0!
40!
80!
120!
160!
200!
240!
Hadoop! HadoopBM! Spark! Hadoop! HadoopBM! Spark!
Logistic Regression! K-Means!
Iterationtime(s)!
First Iteration!
Later Iterations!
Figure 7: Duration of the ﬁrst and later iterations in Hadoop,
HadoopBinMem and Spark for logistic regression and k-means
using 100 GB of data on a 100-node cluster.Michael
Mathioudakis
19

performance

Example: Logistic Regression
0
500
1000
1500
2000
2500
3000
3500
4000
1 5 10 20 30
RunningTime(s)
Number of Iterations
Hadoop
Spark
110 s / iteration
first iteration 80 s
further iterations 1 s
logis6c
regression

2015

Michael
Mathioudakis
20

spark
programming

with
python

Michael
Mathioudakis
21

spark
programming

crea6ng
rdds

transforma6ons
&
ac6ons

lazy
evalua6on
&
persistency

passing
custom
func6ons

working
with
key-‐value
pairs

data
par66oning

accumulators
&
broadcast
variables

pyspark

Michael
Mathioudakis
22

driver
program

contains
the
main
func6on
of
the
applica6on

deﬁnes
rdds

applies
opera6ons
on
them

e.g.,
the
spark
shell
itself
is
a
driver
program

driver
programs
access
spark
through
a
SparkContext
object

Michael
Mathioudakis
23

example

import pyspark
sc = pyspark.SparkContext(master = "local", appName = "tour")
text = sc.textFile(“myfile.txt”) # load data
text.count() # count lines
opera6on

create
rdd

this
part
is
assumed

from
now
on

if
we
are
running
on
a
cluster
on
machines,

different
machines
might
count
different
parts
of
the
file

SparkContext

automa6cally
created
in

spark
shell

Michael
Mathioudakis
24

example

text = sc.textFile("myﬁle.txt") # load data
# keep only lines that mention "Spark"
spark_lines = text.ﬁlter(lambda line: 'Spark' in line)
spark_lines.count() # count lines
opera6on
with
custom
func6on

on
a
cluster,
Spark
ships
the
func6on
to
all
workers

Michael
Mathioudakis
25

lambda
func6ons
in
python

f =( lambda line: 'Spark' in line )
f("we are learning Spark”)
def f(line):
return 'Spark' in line
f("we are learning Spark")
Michael
Mathioudakis
26

stopping

text = sc.textFile("myﬁle.txt") # load data
# keep only lines that mention "Spark"
spark_lines = text.ﬁlter(lambda line: 'Spark' in line)
spark_lines.count() # count lines
sc.stop()
Michael
Mathioudakis
27

rdds

resilient
distributed
datasets

resilient

easy
to
recover

distributed

diﬀerent
par66ons
materialize
on
diﬀerent
nodes

read-‐only
(immutable)

but
can
be
transformed
to
other
rdds

Michael
Mathioudakis
28

crea6ng
rdds

loading
an
external
dataset

text
=
sc.textFile("myﬁle.txt")

distribu6ng
a
collec6on
of
objects

data
=
sc.parallelize(
[0,1,2,3,4,5,6,7,8,9]
)

transforming
other
rdds

text_spark
=
text.ﬁlter(lambda
line:
'Spark'
in
line)

data_length
=
data.map(lambda
num:
num
**
2)

Michael
Mathioudakis
29

rdd
opera6ons

transforma6ons

return
a
new
rdd

ac6ons

extract
informa6on
from

an
rdd
or
save
it
to
disk

errorsRDD
=
inputRDD.filter(lambda
x:
"error"
in
x)

warningsRDD
=
inputRDD.filter(lambda
x:
"warning"
in
x)

badlinesRDD
=
errorsRDD.union(warningsRDD)

print("Input
had",
badlinesRDD.count(),
"concerning
lines.")

print("Here
are
some
of
them:")

for
line
in
badlinesRDD.take(10):

print(line)

inputRDD
=
sc.textFile(”logfile.txt")

Michael
Mathioudakis
30

rdd
opera6ons

def
is_prime(num):

if
num
<
1
or
num
%
1
!=
0:

raise
Excep6on("invalid
argument")

for
d
in
range(2,
int(np.sqrt(num)
+
1)):

if
num
%
d
==
0:

return
False

return
True

numbersRDD
=
sc.parallelize(list(range(1,
1000000)))

primesRDD
=
numbersRDD.ﬁlter(is_prime)

primes
=
primesRDD.collect()

print(primes[:100])

create
RDD

transforma6on

ac6on

opera6on
in
driver
what
if
primes
does

not
ﬁt
in
memory?

transforma6ons

return
a
new
rdd

ac6ons

extract
informa6on
from

an
rdd
or
save
it
to
disk

Michael
Mathioudakis
31

rdd
opera6ons

def
is_prime(num):

if
num
<
1
or
num
%
1
!=
0:

raise
Excep6on("invalid
argument")

for
d
in
range(2,
int(np.sqrt(num)
+
1)):

if
num
%
d
==
0:

return
False

return
True

numbersRDD
=
1000000)))

primesRDD
=

primesRDD.saveAsTextFile("primes.txt")

transforma6ons

return
a
new
rdd

ac6ons

extract
informa6on
from

an
rdd
or
save
it
to
disk

Michael
Mathioudakis
32

rdds

evaluated
lazily

ephemeral

can
persist
in
memory
(or
disk)
if
we
ask

Michael
Mathioudakis
33

lazy
evalua6on

numbersRDD
=
sc.parallelize(range(1,
1000000))

primesRDD
=

primesRDD.saveAsTextFile("primes.txt")

no
cluster
ac6vity
un6l
here

numbersRDD
=
sc.parallelize(range(1,
1000000))

primesRDD
=

primes
=
primesRDD.collect()

print(primes[:100])

no
cluster
ac6vity
un6l
here

Michael
Mathioudakis
34

persistence

RDDs
can
persist
in
memory,

if
we
ask
politely

numbersRDD
=
1000000)))

primesRDD
=

primesRDD.persist()

primesRDD.count()

primesRDD.take(10)

RDD
already
in

memory

causes
RDD
to

materialize

Michael
Mathioudakis
35

persistence

why?

screenshot
from
jupyter
notebook

Michael
Mathioudakis
36

persistence

we
can
ask
Spark
to
maintain
rdds
on
disk

or
even
keep
replicas
on
diﬀerent
nodes

data.persist(pyspark.StorageLevel(useDisk = True, useMemory = True, replication=2))
to
cease
persistence

data.unpersist()
removes
rdd
from
memory
and
disk

Michael
Mathioudakis
37

passing
func6ons

lambda
func6ons

func6on
references

text = sc.textFile("myfile.txt")
text_spark = text.filter(lambda line: 'Spark' in line)
def f(line):
return 'Spark' in line
text_spark = text.filter(f)
Michael
Mathioudakis
38

passing
func6ons

warning!

if
func6on
is
member
of
an
object
(self.method)
or

references
fields
of
an
object
(e.g.,
self.field)...

Spark
serializes
and
sends
the
en6re
object

to
worker
nodes

this
can
be
very
inefficient

Michael
Mathioudakis
39

passing
func6ons

class SearchFunctions(object):
def __init__(self, query):
self.query
def is_match(self, s):
return self.query in s
def get_matches_in_rdd_v1(self, rdd):
return rdd.ﬁlter(self.is_match)
return rdd.ﬁlter(lambda x: self.query in x)
where
is
the
problem
in
the
code
below?

Michael
Mathioudakis
40

passing
func6ons

self.query
return rdd.filter(self.is_match)
return rdd.filter(lambda x: self.query in x)
where
is
the
problem
in
the
code
below?

reference
to
object
method

reference
to
object
field

Michael
Mathioudakis
41

passing
func6ons

beUer
implementa6on

self.query
def get_matches_in_rdd(self, rdd):
query = self.query
return rdd.ﬁlter(lambda x: query in x)
Michael
Mathioudakis
42

common
rdd
opera6ons

element-‐wise
transforma6ons

map
and
filter

inputRDD

{1,2,3,4}

mappedRDD

{1,2,3,4}

filteredRDD

{2,3,4}

.map(lambda
x:
x**2)
.filter(lambda
x:
x!=1)

map’s
return
type
can
be
different
that
its
input’s

Michael
Mathioudakis
43

common
rdd
opera6ons

element-‐wise
transforma6ons

produce
mul6ple
elements
per
input
element

ﬂatMap

phrases = sc.parallelize(["hello world", "how are you", "how do you do"])
words = phrases.ﬂatMap(lambda phrase: phrase.split(" "))
words.count()
9

Michael
Mathioudakis
44

common
rdd
opera6ons

words = phrases.ﬂatMap(lambda phrase: phrase.split(" "))
words.collect()
words = phrases.map(lambda phrase: phrase.split(" "))
words.collect()
how
is
the
result
diﬀerent?

[['hello',
'world'],
['how',
'are',
'you'],
['how',
'do',
'you',
'do']]

['hello',
'world',
'how',
'are',
'you',
'how',
'do',
'you',
'do']

Michael
Mathioudakis
45

common
rdd
opera6ons

oneRDD = sc.parallelize([1, 1, 1, 2, 3, 3, 4, 4])
oneRDD.persist()
otherRDD = sc.parallelize([1, 4, 4, 7])
otherRDD.persist()
(pseudo)
set
opera6ons

Michael
Mathioudakis
46

common
rdd
opera6ons

(pseudo)
set
opera6ons

union

oneRDD.persist()
otherRDD.persist()
oneRDD.union(otherRDD).collect()
[1,
1,
1,
2,
3,
3,
4,
4,
1,
4,
4,
7]

Michael
Mathioudakis
47

common
rdd
opera6ons

subtrac6on

oneRDD.subtract(otherRDD).collect()
[2,
3,
3]

(pseudo)
set
opera6ons

oneRDD.persist()
otherRDD.persist()
Michael
Mathioudakis
48

common
rdd
opera6ons

duplicate
removal

oneRDD.distinct().collect()
[1,
2,
3,
4]

(pseudo)
set
opera6ons

oneRDD.persist()
otherRDD.persist()
Michael
Mathioudakis
49

common
rdd
opera6ons

intersec6on

oneRDD.intersection(otherRDD).collect()
[1,
4]

(pseudo)
set
opera6ons

oneRDD.persist()
otherRDD.persist()
removes
duplicates!

Michael
Mathioudakis
50

common
rdd
opera6ons

cartesian
product

oneRDD.cartesian(otherRDD).collect()[:5]
[(1,
1),
(1,
4),
(1,
4),
(1,
7),
(1,
1)]

(pseudo)
set
opera6ons

oneRDD.persist()
otherRDD.persist()
Michael
Mathioudakis
51

common
rdd
opera6ons

(pseudo)
set
opera6ons

union
subtrac6on

duplicate
removal

intersec6on

cartesian
product

big
difference
in
implementa6on

(and
efficiency)

par66on
shuffling

yes
no

Michael
Mathioudakis
52

sortBy

common
rdd
opera6ons

how
is
sortBy
implemented?

we’ll
see
later...

data = sc.parallelize(np.random.rand(10))
data.sortBy(lambda x: x)
Michael
Mathioudakis
53

common
rdd
opera6ons

ac6ons

reduce

successively
operates
on
two
elements
of
rdd

returns
new
element
of
same
type

data.reduce(lambda x, y: x + y)
181

data = sc.parallelize([1,43,62,23,52])
data.reduce(lambda x, y: x * y)
3188536

commuta6ve

&
associa6ve

func6ons

Michael
Mathioudakis
54

commuta6ve
&
associa6ve

func6on
f(x,
y)

e.g.,
add(x,
y)
=
x
+
y

commuta6ve

f(x,
y)
=
f(y,
x)

e.g.,
add(x,
y)
=
x
+
y
=
y
+
x
=
add(y,
x)

associa6ve

f(x,
f(y,
z))
=
f((x,
y),
z)

e.g.,
add(x,
add(y,
z))
=
x
+
(y
+
z)
=
(x
+
y)
+
z
=
add(add(x,
y),
z)

Michael
Mathioudakis
55

common
rdd
opera6ons

ac6ons

data.reduce(lambda x, y: x**2 + y**2)
137823683725010149883130929

compute
sum
of
squares
of
data

is
the
following
correct?

no
-‐
why?

reduce

successively
operates
on
two
elements
of
rdd

produces
single
aggregate

56

common
rdd
opera6ons

ac6ons

data.reduce(lambda x, y: np.sqrt(x**2 + y**2)) ** 2
8927.0

yes
-‐
why?

compute
sum
of
squares
of
data

is
the
following
correct?

reduce

successively
operates
on
two
elements
of
rdd

produces
single
aggregate

Michael
Mathioudakis
57

common
rdd
opera6ons

ac6ons

aggregate

generalizes
reduce

the
user
provides

a
zero
value

the
iden6ty
element
for
aggrega6on

a
sequen6al
opera6on
(func6on)

to
update
aggrega6on
for
one
more
element
in
one
par66on

a
combining
opera6on
(func6on)

to
combine
aggregates
from
diﬀerent
par66ons

Michael
Mathioudakis
58

common
rdd
opera6ons

ac6ons

aggr = data.aggregate(zeroValue = (0,0),
seqOp = (lambda x, y: (x[0] + y, x[1] + 1)),
combOp = (lambda x, y: (x[0] + y[0], x[1] + y[1])))
aggr[0] / aggr[1]
what
does
the
following
compute?

zero

value

sequen6al

opera6on

combining

opera6on

the
average
value
of
data

aggregate

generalizes
reduce

Michael
Mathioudakis
59

common
rdd
opera6ons

ac6ons

opera5on
return

collect
all
elements

take(num)

num
elements

tries
to
minimize
disk
access
(e.g.,
by
accessing
one

par66on)

takeSample
a
random
sample
of
elements

count
number
of
elements

countByValue

number
of
6me
each
element
appears

ﬁrst
on
each
par66on,
then
combines
par66on
results

top(num)

num
maximum
elements

sorts
par66ons
and
merges

Michael
Mathioudakis
60

all
opera6ons
we
have
described

so
far
apply
to
all
rdds

that’s
why
the
word
“common”
has
been
in
the
6tle

common
rdd
opera6ons

Michael
Mathioudakis
61

pair
rdds

elements
are
key-‐value
pairs

pairRDD = sc.parallelize(range(100)).map(lambda x: (x, x**2))
pairRDD.collect()[:5]
[(0,
0),
(1,
1),
(2,
4),
(3,
9),
(4,
16)]

they
come
from
mapreduce
model

prac6cal
in
many
cases

spark
provides
opera6ons
tailored
to
pair
rdds

Michael
Mathioudakis
62

transforma6ons
on
pair
rdds

pairRDD = sc.parallelize(range(100)).map(lambda x: (x, x**2))
keys
and
values

pairRDD.keys().collect()[:5]
[0,
1,
2,
3,
4]

pairRDD.values().collect()[:5]
[0,
1,
4,
9,
16]

Michael
Mathioudakis
63

transforma6ons
on
pair
rdds

reduceByKey

pairRDD = sc.parallelize([ ('$APPL', 100.64), ('$GOOG', 706.2),
('$AMZN', 552.32), ('$APPL', 100.52), ('$AMZN', 552.32) ])
volumePerKey = pairRDD.reduceByKey(lambda x, y: x + y)
volumePerKey.collect()
[('$APPL',
201.16),
('$AMZN',
1104.64),
('$GOOG',
706.2)]

reduceByKey
is
a
transforma6on

unlike
reduce

Michael
Mathioudakis
64

transforma6ons
on
pair
rdds

combineByKey

generalizes
reduceByKey

user
provides

createCombiner
func6on

provides
the
zero
value
for
each
key

mergeValue
func6on

combines
current
aggregate
in
one
par66on
with
new
value

mergeCombiner

to
combine
aggregates
from
par66ons

Michael
Mathioudakis
65

combineByKey

generalizes
reduceByKey

('$AMZN', 552.32), ('$APPL', 100.52), ('$AMZN', 552.32) ])
aggr = pairRDD.combineByKey(createCombiner = lambda x: (x, 1),
mergeValue = lambda x, y: (x[0] + y, x[1] + 1),
mergeCombiners = lambda x, y: (x[0] + y[0], x[1] + y[1]))
avgPerKey = aggr.map(lambda x: (x[0], x[1][0]/x[1][1]))
avgPerKey.collect()
what
does
the
following
produce

transforma6ons
on
pair
rdds

Michael
Mathioudakis
66

transforma6ons
on
pair
rdds

sortByKey

samples
values
from
rdd
to

es6mate
sorted
par66on
boundaries

shuﬄes
data

sorts
by
external
sor6ng

used
to
implement
common
sortBy

idea:
create
a
pair
RDD
with
(sort-‐key,
item)
elements

apply
sortByKey
on
that

Michael
Mathioudakis
67

transforma6ons
on
pair
rdds

implemented
with
a
variant
of
hash
join

spark
also
has
func6ons
for

le|
outer
join,
right
outer
join,
full
outer
join

(inner)
join

Michael
Mathioudakis
68

transforma6ons
on
pair
rdds

(inner)
join

course_a = sc.parallelize([ ("Antti", 8), ("Tuukka", 10), (“Leena", 9)])
course_b = sc.parallelize([ ("Leena", 10), ("Tuukka", 10)])
result = course_a.join(course_b)
result.collect()
[('Tuukka',
(10,
10)),
('Leena',
(9,
10))]

Michael
Mathioudakis
69

transforma6ons
on
pair
rdds

other
transforma6ons

groupByKey

('$AMZN', 552.32), ('$APPL', 100.52), ('$AMZN', 552.32) ])
pairRDD.groupByKey().collect()
[('$APPL',
<
values
>),

('$AMZN',
<
values
>),

('$GOOG',
<
values
>)]

for
grouping
together
mul6ple
rdds

cogroup
and
groupWith

Michael
Mathioudakis
70

ac6ons
on
pair
rdds

lookup(key)

('$AMZN', 552.32), ('$APPL', 100.52), ('$AMZN', 552.32) ])
countByKey

{'$AMZN':
2,
'$APPL':
2,
'$GOOG':
1}

pairRDD.countByKey()
collectAsMap

pairRDD.collectAsMap()
{'$AMZN':
552.32,
'$APPL':
100.52,
'$GOOG':
706.2}

pairRDD.lookup("$APPL")
[100.64,
100.52]
Michael
Mathioudakis
71

shared
variables

accumulators

write-‐only
for
workers

broadcast
variables

read-‐only
for
workers

Michael
Mathioudakis
72

accumulators

long_lines = sc.accumulator(0)
def line_len(line):
global long_lines
length = len(line)
if length > 30:
long_lines += 1
return length
llengthRDD = text.map(line_len)
llengthRDD.count()
95

long_lines.value
45

lazy!

Michael
Mathioudakis
73

accumulators

fault
tolerance

spark
executes
updates
in
ac6ons
only
once

e.g.,
foreach()

foreach:
special
ac6on

this
is
not
guaranteed
for
transforma=ons

in
transforma6ons,
use
accumulators

only
for
debugging
purposes!

Michael
Mathioudakis
74

accumulators
+
foreach

long_lines = sc.accumulator(0)
def line_len(line):
global long_lines
length = len(line)
if length > 30:
long_lines += 1
text.foreach(line_len)
long_lines.value
45

Michael
Mathioudakis
75

broadcast
variables

sent
to
workers
only
once

read-‐only

even
if
you
change
its
value
on
a
worker,

the
change
does
not
propagate
to
other
workers

(actually
broadcast
object
is
wriUen
to
ﬁle,
read

from
there
by
each
worker)

release
with
unpersist()

Michael
Mathioudakis
76

broadcast
variables

def load_address_table():
return {"Anu": "Chem. A143", "Karmen": "VTT, 74", "Michael": "OIH, B253.2",
"Anwar": "T, B103", "Orestis": "T, A341", "Darshan": "T, A325"}
address_table = sc.broadcast(load_address_table())
def ﬁnd_address(name):
res = None
if name in address_table.value:
res = address_table.value[name]
return res
data = sc.parallelize(["Anwar", "Michael", "Orestis", "Darshan"])
pairRDD = data.map(lambda name: (name, ﬁnd_address(name)))
pairRDD.collectAsMap()
Michael
Mathioudakis
77

par66oning

certain
opera6ons
take
advantage

of
par66oning

e.g.,
reduceByKey,
join

anRDD.par66onBy(numPar66ons,
par66onFunc)

users
can
set
number
of
par66ons
and
par66oning
func6on

Michael
Mathioudakis
78

working
on
per-‐par66on
basis

spark
provides
opera6ons
that
operate

at
par66on
level

e.g.,
mapPar66on

rdd = sc.parallelize(range(100), 4)
def f(iterator): yield sum(iterator)
rdd.mapPartitions(f).collect()
used
in
implementa6on
of
Spark

Michael
Mathioudakis
79

see
implementa6on
of
spark
on

hUps://github.com/apache/spark/

Michael
Mathioudakis
80

references

1.  Zaharia,
Matei,
et
al.
"Spark:
Cluster
Compu6ng
with
Working
Sets."

HotCloud
10
(2010):
10-‐10.

2.  Zaharia,
Matei,
et
al.
"Resilient
distributed
datasets:
A
fault-‐tolerant

abstrac6on
for
in-‐memory
cluster
compu6ng."
Proceedings
of
the
9th

USENIX
conference
on
Networked
Systems
Design
and
Implementa=on.

3.  Learning
Spark:
Lightning-‐Fast
Big
Data
Analysis,
by
Holden
Karau,
Andy

Konwinski,
Patrick
Wendell,
Matei
Zaharia

4.  Spark
programming
guide

hUps://spark.apache.org/docs/latest/programming-‐guide.html

5.  Spark
implementa6on
hUps://github.com/apache/spark/

6.  "Making
Big
Data
Processing
Simple
with
Spark,"
Matei
Zaharia,

hUps://youtu.be/d9D-‐Z3-‐44F8

Michael
Mathioudakis
81

Lecture 07 - CS-5040 - modern database systems

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Lecture 07 - CS-5040 - modern database systems

Similar to Lecture 07 - CS-5040 - modern database systems (20)

Recently uploaded

Recently uploaded (20)

Lecture 07 - CS-5040 - modern database systems