2. logis6cs
assignment
1
currently
marking
might
upload
best
student
solu6ons?
will
provide
annotated
pdfs
with
marks
virtualbox
has
same
path
&
filename
as
previous
one
Michael
Mathioudakis
2
4. previously...
on
modern
database
systems
generaliza6on
of
mapreduce
more
suitable
for
itera6ve
workflows
itera6ve
algorithms
and
repe66ve
querying
rdd:
resilient
distributed
dataset
read-‐only,
lazily
evaluated,
easily
re-‐created
ephemeral,
unless
we
need
to
keep
them
in
memory
Michael
Mathioudakis
4
5. example:
text
search
suppose
that
a
web
service
is
experiencing
errors
you
want
to
search
over
terabytes
of
logs
to
find
the
cause
the
logs
are
stored
in
Hadoop
Filesystem
(HDFS)
errors
are
wriUen
in
the
logs
as
lines
that
start
with
the
keyword
“ERROR”
Michael
Mathioudakis
5
6. example:
text
search
HDFS errors
time fields
map(_.split(‘t’)(3))
Figure 1: Lineage graph for the third query in our example.
Boxes represent RDDs and arrows represent transformations.
lines = spark.textFile("hdfs://...")
errors = lines.filter(_.startsWith("ERROR"))
errors.persist()
Line 1 defines an RDD backed by an HDFS file (as a
collection of lines of text), while line 2 derives a filtered
RDD from it. Line 3 then asks for errors to persist in
memory so that it can be shared across queries. Note that
the argument to filter is Scala syntax for a closure.
At this point, no work has been performed on the clus-
ter. However, the user can now use the RDD in actions,
e.g., to count the number of messages:
F
S
m
W
p
B
e
Ta
2.
T
m
tr
te
a
in
lines = spark.textFile("hdfs://...")
errors = lines.filter(_.startsWith("ERROR"))
errors.persist()
Line 1 defines an RDD backed by an HDFS file (as a
collection of lines of text), while line 2 derives a filtered
RDD from it. Line 3 then asks for errors to persist in
memory so that it can be shared across queries. Note that
the argument to filter is Scala syntax for a closure.
At this point, no work has been performed on the clus-
ter. However, the user can now use the RDD in actions,
e.g., to count the number of messages:
errors.count()
The user can also perform further transformations on
the RDD and use their results, as in the following lines:
// Count errors mentioning MySQL:
errors.filter(_.contains("MySQL")).count()
e
Ta
2.
To
m
tri
te
a
in
bu
gr
w
D
m
to
in
Scala...
rdd
rdd
from
a
file
transforma6on
hint:
keep
in
memory!
no
work
on
the
cluster
so
far
ac6on!
lines
is
not
loaded
to
ram!
Michael
Mathioudakis
6
7. example
-‐
text
search
ctd.
let
us
find
errors
related
to
“MySQL”
Michael
Mathioudakis
7
8. example
-‐
text
search
ctd.
Figure 1: Lineage graph for the third query in our example.
Boxes represent RDDs and arrows represent transformations.
lines = spark.textFile("hdfs://...")
errors = lines.filter(_.startsWith("ERROR"))
errors.persist()
Line 1 defines an RDD backed by an HDFS file (as a
collection of lines of text), while line 2 derives a filtered
RDD from it. Line 3 then asks for errors to persist in
memory so that it can be shared across queries. Note that
the argument to filter is Scala syntax for a closure.
At this point, no work has been performed on the clus-
ter. However, the user can now use the RDD in actions,
e.g., to count the number of messages:
errors.count()
m
W
p
B
e
Ta
2.
T
m
tr
te
a
in
bu
gr
w
collection of lines of text), while line 2 derives a filtered
RDD from it. Line 3 then asks for errors to persist in
memory so that it can be shared across queries. Note that
the argument to filter is Scala syntax for a closure.
At this point, no work has been performed on the clus-
ter. However, the user can now use the RDD in actions,
e.g., to count the number of messages:
errors.count()
The user can also perform further transformations on
the RDD and use their results, as in the following lines:
// Count errors mentioning MySQL:
errors.filter(_.contains("MySQL")).count()
// Return the time fields of errors mentioning
// HDFS as an array (assuming time is field
// number 3 in a tab-separated format):
errors.filter(_.contains("HDFS"))
2.
To
m
tri
te
a
in
bu
gr
w
D
m
to
R
gr
w
to
m
memory so that it can be shared across queries. Note that
the argument to filter is Scala syntax for a closure.
At this point, no work has been performed on the clus-
ter. However, the user can now use the RDD in actions,
e.g., to count the number of messages:
errors.count()
The user can also perform further transformations on
the RDD and use their results, as in the following lines:
// Count errors mentioning MySQL:
errors.filter(_.contains("MySQL")).count()
// Return the time fields of errors mentioning
// HDFS as an array (assuming time is field
// number 3 in a tab-separated format):
errors.filter(_.contains("HDFS"))
.map(_.split(’t’)(3))
.collect()
After the first action involving errors runs, Spark will
memo
tribut
tems,
a glob
includ
but a
graine
which
DSM
make
tolera
Th
RDD
graine
writes
to app
more
need
be rec
partit
ure, a
transforma6on
ac6on
Michael
Mathioudakis
8
9. example
-‐
text
search
ctd.
again
let
us
find
errors
related
to
“HDFS”
and
extract
their
6me
field
assuming
6me
is
field
no.
3
in
tab-‐separated
format
Michael
Mathioudakis
9
10. example
-‐
text
search
ctd.
again
Figure 1: Lineage graph for the third query in our example.
Boxes represent RDDs and arrows represent transformations.
lines = spark.textFile("hdfs://...")
errors = lines.filter(_.startsWith("ERROR"))
errors.persist()
Line 1 defines an RDD backed by an HDFS file (as a
collection of lines of text), while line 2 derives a filtered
RDD from it. Line 3 then asks for errors to persist in
memory so that it can be shared across queries. Note that
the argument to filter is Scala syntax for a closure.
At this point, no work has been performed on the clus-
ter. However, the user can now use the RDD in actions,
e.g., to count the number of messages:
errors.count()
The user can also perform further transformations on
the RDD and use their results, as in the following lines:
Wo
pla
Beh
eno
Tab
2.3
To
mem
trib
tem
a gl
incl
but
grai
whi
DSM
mak
collection of lines of text), while line 2 derives a filtered
RDD from it. Line 3 then asks for errors to persist in
memory so that it can be shared across queries. Note that
the argument to filter is Scala syntax for a closure.
At this point, no work has been performed on the clus-
ter. However, the user can now use the RDD in actions,
e.g., to count the number of messages:
errors.count()
The user can also perform further transformations on
the RDD and use their results, as in the following lines:
// Count errors mentioning MySQL:
errors.filter(_.contains("MySQL")).count()
// Return the time fields of errors mentioning
// HDFS as an array (assuming time is field
// number 3 in a tab-separated format):
errors.filter(_.contains("HDFS"))
.map(_.split(’t’)(3))
.collect()
To
mem
tribu
tem
a gl
incl
but
grai
whi
DSM
mak
tole
T
RDD
grai
writ
to a
mor
need
be r
the argument to filter is Scala syntax for a closure.
At this point, no work has been performed on the clus-
ter. However, the user can now use the RDD in actions,
e.g., to count the number of messages:
errors.count()
The user can also perform further transformations on
the RDD and use their results, as in the following lines:
// Count errors mentioning MySQL:
errors.filter(_.contains("MySQL")).count()
// Return the time fields of errors mentioning
// HDFS as an array (assuming time is field
// number 3 in a tab-separated format):
errors.filter(_.contains("HDFS"))
.map(_.split(’t’)(3))
.collect()
After the first action involving errors runs, Spark will
store the partitions of errors in memory, greatly speed-
ing up subsequent computations on it. Note that the base
RDD, lines, is not loaded into RAM. This is desirable
tributed
tems, ap
a global
include
but also
grained
which p
DSM is
makes
tolerant
The m
RDDs c
grained
writes t
to appli
more ef
need to
be reco
partition
ure, and
nodes, w
A sec
At this point, no work has been performed on the clus-
ter. However, the user can now use the RDD in actions,
e.g., to count the number of messages:
errors.count()
The user can also perform further transformations on
the RDD and use their results, as in the following lines:
// Count errors mentioning MySQL:
errors.filter(_.contains("MySQL")).count()
// Return the time fields of errors mentioning
// HDFS as an array (assuming time is field
// number 3 in a tab-separated format):
errors.filter(_.contains("HDFS"))
.map(_.split(’t’)(3))
.collect()
After the first action involving errors runs, Spark will
store the partitions of errors in memory, greatly speed-
tems
a glo
inclu
but
grain
whic
DSM
mak
toler
Th
RDD
grain
write
to ap
more
need
be re
parti
ure,
node
transforma6ons
ac6on
Michael
Mathioudakis
10
11. example:
text
search
lineage
of
6me
fields
lines
errors
filter(_.startsWith(“ERROR”))
HDFS errors
time fields
filter(_.contains(“HDFS”)))
map(_.split(‘t’)(3))
Figure 1: Lineage graph for the third query in our example.
Boxes represent RDDs and arrows represent transformations.
lines = spark.textFile("hdfs://...")
errors = lines.filter(_.startsWith("ERROR"))
cached
pipelined
transforma6ons
if
a
par66on
of
errors
is
lost,
filter
is
applied
only
the
corresponding
par66on
of
lines
Michael
Mathioudakis
11
12. represen6ng
rdds
internal
informa6on
about
rdds
par66ons
&
par66oning
scheme
dependencies
on
parent
RDDs
func6on
to
compute
it
from
parents
Michael
Mathioudakis
12
13. rdd
dependencies
narrow
dependencies
each
par66on
of
the
parent
rdd
is
used
by
at
most
one
par66on
of
the
child
rdd
otherwise,
wide
dependencies
Michael
Mathioudakis
13
14. rdd
dependencies
union
groupByKey
join with inputs not
co-partitioned
join with inputs
co-partitioned
map, filter
Narrow Dependencies: Wide Dependencies:
Figure 4: Examples of narrow and wide dependencies. EachMichael
Mathioudakis
14
15. scheduling
when
an
ac6on
is
performed...
(e.g.,
count()
or
save())
...
the
scheduler
examines
the
lineage
graph
builds
a
DAG
of
stages
to
execute
each
stage
is
a
maximal
pipeline
of
transforma6ons
over
narrow
dependencies
Michael
Mathioudakis
15
16. scheduling
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
Figure 5: Example of how Spark computes job stages. Boxes
with solid outlines are RDDs. Partitions are shaded rectangles,
in black if they are already in memory. To run an action on RDD
rdd
par66on
already
in
ram
Michael
Mathioudakis
16
17. memory
management
when
not
enough
memory
apply
LRU
evic6on
policy
at
rdd
level
evict
par66on
from
least
recently
used
rdd
Michael
Mathioudakis
17
18. performance
logis6c
regression
and
k-‐means
amazon
EC2
10
itera6ons
on
100GB
datasets
100
node-‐clusters
Michael
Mathioudakis
18
22. spark
programming
crea6ng
rdds
transforma6ons
&
ac6ons
lazy
evalua6on
&
persistency
passing
custom
func6ons
working
with
key-‐value
pairs
data
par66oning
accumulators
&
broadcast
variables
pyspark
Michael
Mathioudakis
22
23. driver
program
contains
the
main
func6on
of
the
applica6on
defines
rdds
applies
opera6ons
on
them
e.g.,
the
spark
shell
itself
is
a
driver
program
driver
programs
access
spark
through
a
SparkContext
object
Michael
Mathioudakis
23
24. example
import pyspark
sc = pyspark.SparkContext(master = "local", appName = "tour")
text = sc.textFile(“myfile.txt”) # load data
text.count() # count lines
opera6on
create
rdd
this
part
is
assumed
from
now
on
if
we
are
running
on
a
cluster
on
machines,
different
machines
might
count
different
parts
of
the
file
SparkContext
automa6cally
created
in
spark
shell
Michael
Mathioudakis
24
25. example
text = sc.textFile("myfile.txt") # load data
# keep only lines that mention "Spark"
spark_lines = text.filter(lambda line: 'Spark' in line)
spark_lines.count() # count lines
opera6on
with
custom
func6on
on
a
cluster,
Spark
ships
the
func6on
to
all
workers
Michael
Mathioudakis
25
26. lambda
func6ons
in
python
f =( lambda line: 'Spark' in line )
f("we are learning Spark”)
def f(line):
return 'Spark' in line
f("we are learning Spark")
Michael
Mathioudakis
26
27. stopping
text = sc.textFile("myfile.txt") # load data
# keep only lines that mention "Spark"
spark_lines = text.filter(lambda line: 'Spark' in line)
spark_lines.count() # count lines
sc.stop()
Michael
Mathioudakis
27
28. rdds
resilient
distributed
datasets
resilient
easy
to
recover
distributed
different
par66ons
materialize
on
different
nodes
read-‐only
(immutable)
but
can
be
transformed
to
other
rdds
Michael
Mathioudakis
28
29. crea6ng
rdds
loading
an
external
dataset
text
=
sc.textFile("myfile.txt")
distribu6ng
a
collec6on
of
objects
data
=
sc.parallelize(
[0,1,2,3,4,5,6,7,8,9]
)
transforming
other
rdds
text_spark
=
text.filter(lambda
line:
'Spark'
in
line)
data_length
=
data.map(lambda
num:
num
**
2)
Michael
Mathioudakis
29
30. rdd
opera6ons
transforma6ons
return
a
new
rdd
ac6ons
extract
informa6on
from
an
rdd
or
save
it
to
disk
errorsRDD
=
inputRDD.filter(lambda
x:
"error"
in
x)
warningsRDD
=
inputRDD.filter(lambda
x:
"warning"
in
x)
badlinesRDD
=
errorsRDD.union(warningsRDD)
print("Input
had",
badlinesRDD.count(),
"concerning
lines.")
print("Here
are
some
of
them:")
for
line
in
badlinesRDD.take(10):
print(line)
inputRDD
=
sc.textFile(”logfile.txt")
Michael
Mathioudakis
30
31. rdd
opera6ons
def
is_prime(num):
if
num
<
1
or
num
%
1
!=
0:
raise
Excep6on("invalid
argument")
for
d
in
range(2,
int(np.sqrt(num)
+
1)):
if
num
%
d
==
0:
return
False
return
True
numbersRDD
=
sc.parallelize(list(range(1,
1000000)))
primesRDD
=
numbersRDD.filter(is_prime)
primes
=
primesRDD.collect()
print(primes[:100])
create
RDD
transforma6on
ac6on
opera6on
in
driver
what
if
primes
does
not
fit
in
memory?
transforma6ons
return
a
new
rdd
ac6ons
extract
informa6on
from
an
rdd
or
save
it
to
disk
Michael
Mathioudakis
31
32. rdd
opera6ons
def
is_prime(num):
if
num
<
1
or
num
%
1
!=
0:
raise
Excep6on("invalid
argument")
for
d
in
range(2,
int(np.sqrt(num)
+
1)):
if
num
%
d
==
0:
return
False
return
True
numbersRDD
=
sc.parallelize(list(range(1,
1000000)))
primesRDD
=
numbersRDD.filter(is_prime)
primesRDD.saveAsTextFile("primes.txt")
transforma6ons
return
a
new
rdd
ac6ons
extract
informa6on
from
an
rdd
or
save
it
to
disk
Michael
Mathioudakis
32
33. rdds
evaluated
lazily
ephemeral
can
persist
in
memory
(or
disk)
if
we
ask
Michael
Mathioudakis
33
34. lazy
evalua6on
numbersRDD
=
sc.parallelize(range(1,
1000000))
primesRDD
=
numbersRDD.filter(is_prime)
primesRDD.saveAsTextFile("primes.txt")
no
cluster
ac6vity
un6l
here
numbersRDD
=
sc.parallelize(range(1,
1000000))
primesRDD
=
numbersRDD.filter(is_prime)
primes
=
primesRDD.collect()
print(primes[:100])
no
cluster
ac6vity
un6l
here
Michael
Mathioudakis
34
35. persistence
RDDs
can
persist
in
memory,
if
we
ask
politely
numbersRDD
=
sc.parallelize(list(range(1,
1000000)))
primesRDD
=
numbersRDD.filter(is_prime)
primesRDD.persist()
primesRDD.count()
primesRDD.take(10)
RDD
already
in
memory
causes
RDD
to
materialize
Michael
Mathioudakis
35
37. persistence
we
can
ask
Spark
to
maintain
rdds
on
disk
or
even
keep
replicas
on
different
nodes
data.persist(pyspark.StorageLevel(useDisk = True, useMemory = True, replication=2))
to
cease
persistence
data.unpersist()
removes
rdd
from
memory
and
disk
Michael
Mathioudakis
37
38. passing
func6ons
lambda
func6ons
func6on
references
text = sc.textFile("myfile.txt")
text_spark = text.filter(lambda line: 'Spark' in line)
def f(line):
return 'Spark' in line
text = sc.textFile("myfile.txt")
text_spark = text.filter(f)
Michael
Mathioudakis
38
39. passing
func6ons
warning!
if
func6on
is
member
of
an
object
(self.method)
or
references
fields
of
an
object
(e.g.,
self.field)...
Spark
serializes
and
sends
the
en6re
object
to
worker
nodes
this
can
be
very
inefficient
Michael
Mathioudakis
39
40. passing
func6ons
class SearchFunctions(object):
def __init__(self, query):
self.query
def is_match(self, s):
return self.query in s
def get_matches_in_rdd_v1(self, rdd):
return rdd.filter(self.is_match)
def get_matches_in_rdd_v2(self, rdd):
return rdd.filter(lambda x: self.query in x)
where
is
the
problem
in
the
code
below?
Michael
Mathioudakis
40
41. passing
func6ons
class SearchFunctions(object):
def __init__(self, query):
self.query
def is_match(self, s):
return self.query in s
def get_matches_in_rdd_v1(self, rdd):
return rdd.filter(self.is_match)
def get_matches_in_rdd_v2(self, rdd):
return rdd.filter(lambda x: self.query in x)
where
is
the
problem
in
the
code
below?
reference
to
object
method
reference
to
object
field
Michael
Mathioudakis
41
42. passing
func6ons
beUer
implementa6on
class SearchFunctions(object):
def __init__(self, query):
self.query
def is_match(self, s):
return self.query in s
def get_matches_in_rdd(self, rdd):
query = self.query
return rdd.filter(lambda x: query in x)
Michael
Mathioudakis
42
43. common
rdd
opera6ons
element-‐wise
transforma6ons
map
and
filter
inputRDD
{1,2,3,4}
mappedRDD
{1,2,3,4}
filteredRDD
{2,3,4}
.map(lambda
x:
x**2)
.filter(lambda
x:
x!=1)
map’s
return
type
can
be
different
that
its
input’s
Michael
Mathioudakis
43
44. common
rdd
opera6ons
element-‐wise
transforma6ons
produce
mul6ple
elements
per
input
element
flatMap
phrases = sc.parallelize(["hello world", "how are you", "how do you do"])
words = phrases.flatMap(lambda phrase: phrase.split(" "))
words.count()
9
Michael
Mathioudakis
44
45. common
rdd
opera6ons
phrases = sc.parallelize(["hello world", "how are you", "how do you do"])
words = phrases.flatMap(lambda phrase: phrase.split(" "))
words.collect()
phrases = sc.parallelize(["hello world", "how are you", "how do you do"])
words = phrases.map(lambda phrase: phrase.split(" "))
words.collect()
how
is
the
result
different?
[['hello',
'world'],
['how',
'are',
'you'],
['how',
'do',
'you',
'do']]
['hello',
'world',
'how',
'are',
'you',
'how',
'do',
'you',
'do']
Michael
Mathioudakis
45
46. common
rdd
opera6ons
oneRDD = sc.parallelize([1, 1, 1, 2, 3, 3, 4, 4])
oneRDD.persist()
otherRDD = sc.parallelize([1, 4, 4, 7])
otherRDD.persist()
(pseudo)
set
opera6ons
Michael
Mathioudakis
46
52. common
rdd
opera6ons
(pseudo)
set
opera6ons
union
subtrac6on
duplicate
removal
intersec6on
cartesian
product
big
difference
in
implementa6on
(and
efficiency)
par66on
shuffling
yes
no
Michael
Mathioudakis
52
53. sortBy
common
rdd
opera6ons
how
is
sortBy
implemented?
we’ll
see
later...
data = sc.parallelize(np.random.rand(10))
data.sortBy(lambda x: x)
Michael
Mathioudakis
53
54. common
rdd
opera6ons
ac6ons
reduce
successively
operates
on
two
elements
of
rdd
returns
new
element
of
same
type
data.reduce(lambda x, y: x + y)
181
data = sc.parallelize([1,43,62,23,52])
data.reduce(lambda x, y: x * y)
3188536
commuta6ve
&
associa6ve
func6ons
Michael
Mathioudakis
54
55. commuta6ve
&
associa6ve
func6on
f(x,
y)
e.g.,
add(x,
y)
=
x
+
y
commuta6ve
f(x,
y)
=
f(y,
x)
e.g.,
add(x,
y)
=
x
+
y
=
y
+
x
=
add(y,
x)
associa6ve
f(x,
f(y,
z))
=
f((x,
y),
z)
e.g.,
add(x,
add(y,
z))
=
x
+
(y
+
z)
=
(x
+
y)
+
z
=
add(add(x,
y),
z)
Michael
Mathioudakis
55
56. common
rdd
opera6ons
ac6ons
data = sc.parallelize([1,43,62,23,52])
data.reduce(lambda x, y: x**2 + y**2)
137823683725010149883130929
compute
sum
of
squares
of
data
is
the
following
correct?
no
-‐
why?
reduce
successively
operates
on
two
elements
of
rdd
produces
single
aggregate
56
57. common
rdd
opera6ons
ac6ons
data = sc.parallelize([1,43,62,23,52])
data.reduce(lambda x, y: np.sqrt(x**2 + y**2)) ** 2
8927.0
yes
-‐
why?
compute
sum
of
squares
of
data
is
the
following
correct?
reduce
successively
operates
on
two
elements
of
rdd
produces
single
aggregate
Michael
Mathioudakis
57
58. common
rdd
opera6ons
ac6ons
aggregate
generalizes
reduce
the
user
provides
a
zero
value
the
iden6ty
element
for
aggrega6on
a
sequen6al
opera6on
(func6on)
to
update
aggrega6on
for
one
more
element
in
one
par66on
a
combining
opera6on
(func6on)
to
combine
aggregates
from
different
par66ons
Michael
Mathioudakis
58
59. common
rdd
opera6ons
ac6ons
data = sc.parallelize([1,43,62,23,52])
aggr = data.aggregate(zeroValue = (0,0),
seqOp = (lambda x, y: (x[0] + y, x[1] + 1)),
combOp = (lambda x, y: (x[0] + y[0], x[1] + y[1])))
aggr[0] / aggr[1]
what
does
the
following
compute?
zero
value
sequen6al
opera6on
combining
opera6on
the
average
value
of
data
aggregate
generalizes
reduce
Michael
Mathioudakis
59
60. common
rdd
opera6ons
ac6ons
opera5on
return
collect
all
elements
take(num)
num
elements
tries
to
minimize
disk
access
(e.g.,
by
accessing
one
par66on)
takeSample
a
random
sample
of
elements
count
number
of
elements
countByValue
number
of
6me
each
element
appears
first
on
each
par66on,
then
combines
par66on
results
top(num)
num
maximum
elements
sorts
par66ons
and
merges
Michael
Mathioudakis
60
61. all
opera6ons
we
have
described
so
far
apply
to
all
rdds
that’s
why
the
word
“common”
has
been
in
the
6tle
common
rdd
opera6ons
Michael
Mathioudakis
61
62. pair
rdds
elements
are
key-‐value
pairs
pairRDD = sc.parallelize(range(100)).map(lambda x: (x, x**2))
pairRDD.collect()[:5]
[(0,
0),
(1,
1),
(2,
4),
(3,
9),
(4,
16)]
they
come
from
mapreduce
model
prac6cal
in
many
cases
spark
provides
opera6ons
tailored
to
pair
rdds
Michael
Mathioudakis
62
63. transforma6ons
on
pair
rdds
pairRDD = sc.parallelize(range(100)).map(lambda x: (x, x**2))
keys
and
values
pairRDD.keys().collect()[:5]
[0,
1,
2,
3,
4]
pairRDD.values().collect()[:5]
[0,
1,
4,
9,
16]
Michael
Mathioudakis
63
64. transforma6ons
on
pair
rdds
reduceByKey
pairRDD = sc.parallelize([ ('$APPL', 100.64), ('$GOOG', 706.2),
('$AMZN', 552.32), ('$APPL', 100.52), ('$AMZN', 552.32) ])
volumePerKey = pairRDD.reduceByKey(lambda x, y: x + y)
volumePerKey.collect()
[('$APPL',
201.16),
('$AMZN',
1104.64),
('$GOOG',
706.2)]
reduceByKey
is
a
transforma6on
unlike
reduce
Michael
Mathioudakis
64
65. transforma6ons
on
pair
rdds
combineByKey
generalizes
reduceByKey
user
provides
createCombiner
func6on
provides
the
zero
value
for
each
key
mergeValue
func6on
combines
current
aggregate
in
one
par66on
with
new
value
mergeCombiner
to
combine
aggregates
from
par66ons
Michael
Mathioudakis
65
66. combineByKey
generalizes
reduceByKey
pairRDD = sc.parallelize([ ('$APPL', 100.64), ('$GOOG', 706.2),
('$AMZN', 552.32), ('$APPL', 100.52), ('$AMZN', 552.32) ])
aggr = pairRDD.combineByKey(createCombiner = lambda x: (x, 1),
mergeValue = lambda x, y: (x[0] + y, x[1] + 1),
mergeCombiners = lambda x, y: (x[0] + y[0], x[1] + y[1]))
avgPerKey = aggr.map(lambda x: (x[0], x[1][0]/x[1][1]))
avgPerKey.collect()
what
does
the
following
produce
transforma6ons
on
pair
rdds
Michael
Mathioudakis
66
67. transforma6ons
on
pair
rdds
sortByKey
samples
values
from
rdd
to
es6mate
sorted
par66on
boundaries
shuffles
data
sorts
by
external
sor6ng
used
to
implement
common
sortBy
idea:
create
a
pair
RDD
with
(sort-‐key,
item)
elements
apply
sortByKey
on
that
Michael
Mathioudakis
67
68. transforma6ons
on
pair
rdds
implemented
with
a
variant
of
hash
join
spark
also
has
func6ons
for
le|
outer
join,
right
outer
join,
full
outer
join
(inner)
join
Michael
Mathioudakis
68
72. shared
variables
accumulators
write-‐only
for
workers
broadcast
variables
read-‐only
for
workers
Michael
Mathioudakis
72
73. accumulators
text = sc.textFile("myfile.txt")
long_lines = sc.accumulator(0)
def line_len(line):
global long_lines
length = len(line)
if length > 30:
long_lines += 1
return length
llengthRDD = text.map(line_len)
llengthRDD.count()
95
long_lines.value
45
lazy!
Michael
Mathioudakis
73
74. accumulators
fault
tolerance
spark
executes
updates
in
ac6ons
only
once
e.g.,
foreach()
foreach:
special
ac6on
this
is
not
guaranteed
for
transforma=ons
in
transforma6ons,
use
accumulators
only
for
debugging
purposes!
Michael
Mathioudakis
74
75. accumulators
+
foreach
text = sc.textFile("myfile.txt")
long_lines = sc.accumulator(0)
def line_len(line):
global long_lines
length = len(line)
if length > 30:
long_lines += 1
text.foreach(line_len)
long_lines.value
45
Michael
Mathioudakis
75
76. broadcast
variables
sent
to
workers
only
once
read-‐only
even
if
you
change
its
value
on
a
worker,
the
change
does
not
propagate
to
other
workers
(actually
broadcast
object
is
wriUen
to
file,
read
from
there
by
each
worker)
release
with
unpersist()
Michael
Mathioudakis
76
77. broadcast
variables
def load_address_table():
return {"Anu": "Chem. A143", "Karmen": "VTT, 74", "Michael": "OIH, B253.2",
"Anwar": "T, B103", "Orestis": "T, A341", "Darshan": "T, A325"}
address_table = sc.broadcast(load_address_table())
def find_address(name):
res = None
if name in address_table.value:
res = address_table.value[name]
return res
data = sc.parallelize(["Anwar", "Michael", "Orestis", "Darshan"])
pairRDD = data.map(lambda name: (name, find_address(name)))
pairRDD.collectAsMap()
Michael
Mathioudakis
77
78. par66oning
certain
opera6ons
take
advantage
of
par66oning
e.g.,
reduceByKey,
join
anRDD.par66onBy(numPar66ons,
par66onFunc)
users
can
set
number
of
par66ons
and
par66oning
func6on
Michael
Mathioudakis
78
79. working
on
per-‐par66on
basis
spark
provides
opera6ons
that
operate
at
par66on
level
e.g.,
mapPar66on
rdd = sc.parallelize(range(100), 4)
def f(iterator): yield sum(iterator)
rdd.mapPartitions(f).collect()
used
in
implementa6on
of
Spark
Michael
Mathioudakis
79
80. see
implementa6on
of
spark
on
hUps://github.com/apache/spark/
Michael
Mathioudakis
80
81. references
1. Zaharia,
Matei,
et
al.
"Spark:
Cluster
Compu6ng
with
Working
Sets."
HotCloud
10
(2010):
10-‐10.
2. Zaharia,
Matei,
et
al.
"Resilient
distributed
datasets:
A
fault-‐tolerant
abstrac6on
for
in-‐memory
cluster
compu6ng."
Proceedings
of
the
9th
USENIX
conference
on
Networked
Systems
Design
and
Implementa=on.
3. Learning
Spark:
Lightning-‐Fast
Big
Data
Analysis,
by
Holden
Karau,
Andy
Konwinski,
Patrick
Wendell,
Matei
Zaharia
4. Spark
programming
guide
hUps://spark.apache.org/docs/latest/programming-‐guide.html
5. Spark
implementa6on
hUps://github.com/apache/spark/
6. "Making
Big
Data
Processing
Simple
with
Spark,"
Matei
Zaharia,
hUps://youtu.be/d9D-‐Z3-‐44F8
Michael
Mathioudakis
81