SlideShare a Scribd company logo
1 of 81
Download to read offline
Modern	
  Database	
  Systems	
  
Lecture	
  7	
  
Aris6des	
  Gionis	
  
Michael	
  Mathioudakis	
  
	
  
Spring	
  2016	
  
logis6cs	
  
assignment	
  1	
  
	
  currently	
  marking	
  
might	
  upload	
  best	
  student	
  solu6ons?	
  
will	
  provide	
  annotated	
  pdfs	
  with	
  marks	
  
	
  
virtualbox	
  has	
  same	
  path	
  &	
  filename	
  as	
  previous	
  one	
  
Michael	
  Mathioudakis	
   2	
  
con6nuing	
  from	
  last	
  lecture	
  
original	
  paper	
  on	
  spark	
  
Michael	
  Mathioudakis	
   3	
  
previously...	
  
	
  on	
  modern	
  database	
  systems	
  
generaliza6on	
  of	
  mapreduce	
  
more	
  suitable	
  for	
  itera6ve	
  workflows	
  
itera6ve	
  algorithms	
  and	
  repe66ve	
  querying	
  
	
  
rdd:	
  resilient	
  distributed	
  dataset	
  
read-­‐only,	
  lazily	
  evaluated,	
  easily	
  re-­‐created	
  
ephemeral,	
  unless	
  we	
  need	
  to	
  keep	
  them	
  in	
  memory	
  
Michael	
  Mathioudakis	
   4	
  
example:	
  text	
  search	
  
suppose	
  that	
  a	
  web	
  service	
  is	
  experiencing	
  errors	
  
you	
  want	
  to	
  search	
  over	
  terabytes	
  of	
  logs	
  to	
  find	
  the	
  cause	
  
the	
  logs	
  are	
  stored	
  in	
  Hadoop	
  Filesystem	
  (HDFS)	
  
errors	
  are	
  wriUen	
  in	
  the	
  logs	
  as	
  lines	
  that	
  	
  
start	
  with	
  the	
  keyword	
  “ERROR”	
  
Michael	
  Mathioudakis	
   5	
  
example:	
  text	
  search	
  
HDFS errors
time fields
map(_.split(‘t’)(3))
Figure 1: Lineage graph for the third query in our example.
Boxes represent RDDs and arrows represent transformations.
lines = spark.textFile("hdfs://...")
errors = lines.filter(_.startsWith("ERROR"))
errors.persist()
Line 1 defines an RDD backed by an HDFS file (as a
collection of lines of text), while line 2 derives a filtered
RDD from it. Line 3 then asks for errors to persist in
memory so that it can be shared across queries. Note that
the argument to filter is Scala syntax for a closure.
At this point, no work has been performed on the clus-
ter. However, the user can now use the RDD in actions,
e.g., to count the number of messages:
F
S
m
W
p
B
e
Ta
2.
T
m
tr
te
a
in
lines = spark.textFile("hdfs://...")
errors = lines.filter(_.startsWith("ERROR"))
errors.persist()
Line 1 defines an RDD backed by an HDFS file (as a
collection of lines of text), while line 2 derives a filtered
RDD from it. Line 3 then asks for errors to persist in
memory so that it can be shared across queries. Note that
the argument to filter is Scala syntax for a closure.
At this point, no work has been performed on the clus-
ter. However, the user can now use the RDD in actions,
e.g., to count the number of messages:
errors.count()
The user can also perform further transformations on
the RDD and use their results, as in the following lines:
// Count errors mentioning MySQL:
errors.filter(_.contains("MySQL")).count()
e
Ta
2.
To
m
tri
te
a
in
bu
gr
w
D
m
to
in	
  Scala...	
  
rdd	
  
rdd	
  
from	
  a	
  file	
  
transforma6on	
  
hint:	
  keep	
  in	
  memory!	
  
no	
  work	
  on	
  the	
  cluster	
  so	
  far	
  
ac6on!	
  
lines	
  is	
  not	
  loaded	
  to	
  ram!	
  
Michael	
  Mathioudakis	
   6	
  
example	
  -­‐	
  text	
  search	
  ctd.	
  
let	
  us	
  find	
  errors	
  related	
  to	
  “MySQL”	
  
Michael	
  Mathioudakis	
   7	
  
example	
  -­‐	
  text	
  search	
  ctd.	
  
Figure 1: Lineage graph for the third query in our example.
Boxes represent RDDs and arrows represent transformations.
lines = spark.textFile("hdfs://...")
errors = lines.filter(_.startsWith("ERROR"))
errors.persist()
Line 1 defines an RDD backed by an HDFS file (as a
collection of lines of text), while line 2 derives a filtered
RDD from it. Line 3 then asks for errors to persist in
memory so that it can be shared across queries. Note that
the argument to filter is Scala syntax for a closure.
At this point, no work has been performed on the clus-
ter. However, the user can now use the RDD in actions,
e.g., to count the number of messages:
errors.count()
m
W
p
B
e
Ta
2.
T
m
tr
te
a
in
bu
gr
w
collection of lines of text), while line 2 derives a filtered
RDD from it. Line 3 then asks for errors to persist in
memory so that it can be shared across queries. Note that
the argument to filter is Scala syntax for a closure.
At this point, no work has been performed on the clus-
ter. However, the user can now use the RDD in actions,
e.g., to count the number of messages:
errors.count()
The user can also perform further transformations on
the RDD and use their results, as in the following lines:
// Count errors mentioning MySQL:
errors.filter(_.contains("MySQL")).count()
// Return the time fields of errors mentioning
// HDFS as an array (assuming time is field
// number 3 in a tab-separated format):
errors.filter(_.contains("HDFS"))
2.
To
m
tri
te
a
in
bu
gr
w
D
m
to
R
gr
w
to
m
memory so that it can be shared across queries. Note that
the argument to filter is Scala syntax for a closure.
At this point, no work has been performed on the clus-
ter. However, the user can now use the RDD in actions,
e.g., to count the number of messages:
errors.count()
The user can also perform further transformations on
the RDD and use their results, as in the following lines:
// Count errors mentioning MySQL:
errors.filter(_.contains("MySQL")).count()
// Return the time fields of errors mentioning
// HDFS as an array (assuming time is field
// number 3 in a tab-separated format):
errors.filter(_.contains("HDFS"))
.map(_.split(’t’)(3))
.collect()
After the first action involving errors runs, Spark will
memo
tribut
tems,
a glob
includ
but a
graine
which
DSM
make
tolera
Th
RDD
graine
writes
to app
more
need
be rec
partit
ure, a
transforma6on	
   ac6on	
  
Michael	
  Mathioudakis	
   8	
  
example	
  -­‐	
  text	
  search	
  ctd.	
  again	
  
let	
  us	
  find	
  errors	
  related	
  to	
  “HDFS”	
  and	
  extract	
  
their	
  6me	
  field	
  
assuming	
  6me	
  is	
  field	
  no.	
  3	
  in	
  tab-­‐separated	
  format	
  
Michael	
  Mathioudakis	
   9	
  
example	
  -­‐	
  text	
  search	
  ctd.	
  again	
  
Figure 1: Lineage graph for the third query in our example.
Boxes represent RDDs and arrows represent transformations.
lines = spark.textFile("hdfs://...")
errors = lines.filter(_.startsWith("ERROR"))
errors.persist()
Line 1 defines an RDD backed by an HDFS file (as a
collection of lines of text), while line 2 derives a filtered
RDD from it. Line 3 then asks for errors to persist in
memory so that it can be shared across queries. Note that
the argument to filter is Scala syntax for a closure.
At this point, no work has been performed on the clus-
ter. However, the user can now use the RDD in actions,
e.g., to count the number of messages:
errors.count()
The user can also perform further transformations on
the RDD and use their results, as in the following lines:
Wo
pla
Beh
eno
Tab
2.3
To
mem
trib
tem
a gl
incl
but
grai
whi
DSM
mak
collection of lines of text), while line 2 derives a filtered
RDD from it. Line 3 then asks for errors to persist in
memory so that it can be shared across queries. Note that
the argument to filter is Scala syntax for a closure.
At this point, no work has been performed on the clus-
ter. However, the user can now use the RDD in actions,
e.g., to count the number of messages:
errors.count()
The user can also perform further transformations on
the RDD and use their results, as in the following lines:
// Count errors mentioning MySQL:
errors.filter(_.contains("MySQL")).count()
// Return the time fields of errors mentioning
// HDFS as an array (assuming time is field
// number 3 in a tab-separated format):
errors.filter(_.contains("HDFS"))
.map(_.split(’t’)(3))
.collect()
To
mem
tribu
tem
a gl
incl
but
grai
whi
DSM
mak
tole
T
RDD
grai
writ
to a
mor
need
be r
the argument to filter is Scala syntax for a closure.
At this point, no work has been performed on the clus-
ter. However, the user can now use the RDD in actions,
e.g., to count the number of messages:
errors.count()
The user can also perform further transformations on
the RDD and use their results, as in the following lines:
// Count errors mentioning MySQL:
errors.filter(_.contains("MySQL")).count()
// Return the time fields of errors mentioning
// HDFS as an array (assuming time is field
// number 3 in a tab-separated format):
errors.filter(_.contains("HDFS"))
.map(_.split(’t’)(3))
.collect()
After the first action involving errors runs, Spark will
store the partitions of errors in memory, greatly speed-
ing up subsequent computations on it. Note that the base
RDD, lines, is not loaded into RAM. This is desirable
tributed
tems, ap
a global
include
but also
grained
which p
DSM is
makes
tolerant
The m
RDDs c
grained
writes t
to appli
more ef
need to
be reco
partition
ure, and
nodes, w
A sec
At this point, no work has been performed on the clus-
ter. However, the user can now use the RDD in actions,
e.g., to count the number of messages:
errors.count()
The user can also perform further transformations on
the RDD and use their results, as in the following lines:
// Count errors mentioning MySQL:
errors.filter(_.contains("MySQL")).count()
// Return the time fields of errors mentioning
// HDFS as an array (assuming time is field
// number 3 in a tab-separated format):
errors.filter(_.contains("HDFS"))
.map(_.split(’t’)(3))
.collect()
After the first action involving errors runs, Spark will
store the partitions of errors in memory, greatly speed-
tems
a glo
inclu
but
grain
whic
DSM
mak
toler
Th
RDD
grain
write
to ap
more
need
be re
parti
ure,
node
transforma6ons	
  
ac6on	
  
Michael	
  Mathioudakis	
   10	
  
example:	
  text	
  search	
  
lineage	
  of	
  6me	
  fields	
  
lines
errors
filter(_.startsWith(“ERROR”))
HDFS errors
time fields
filter(_.contains(“HDFS”)))
map(_.split(‘t’)(3))
Figure 1: Lineage graph for the third query in our example.
Boxes represent RDDs and arrows represent transformations.
lines = spark.textFile("hdfs://...")
errors = lines.filter(_.startsWith("ERROR"))
cached	
  
pipelined	
  
transforma6ons	
  if	
  a	
  par66on	
  of	
  errors	
  is	
  lost,	
  
filter	
  is	
  applied	
  only	
  the	
  
corresponding	
  par66on	
  of	
  lines	
  
Michael	
  Mathioudakis	
   11	
  
represen6ng	
  rdds	
  
internal	
  informa6on	
  about	
  rdds	
  
	
  
par66ons	
  &	
  par66oning	
  scheme	
  
dependencies	
  on	
  parent	
  RDDs	
  
func6on	
  to	
  compute	
  it	
  from	
  parents	
  
	
  
Michael	
  Mathioudakis	
   12	
  
rdd	
  dependencies	
  
narrow	
  dependencies	
  
each	
  par66on	
  of	
  the	
  parent	
  rdd	
  is	
  used	
  by	
  
at	
  most	
  one	
  par66on	
  of	
  the	
  child	
  rdd	
  
	
  
otherwise,	
  wide	
  dependencies	
  
Michael	
  Mathioudakis	
   13	
  
rdd	
  dependencies	
  
union
groupByKey
join with inputs not
co-partitioned
join with inputs
co-partitioned
map, filter
Narrow Dependencies: Wide Dependencies:
Figure 4: Examples of narrow and wide dependencies. EachMichael	
  Mathioudakis	
   14	
  
scheduling	
  
when	
  an	
  ac6on	
  is	
  performed...	
  
(e.g.,	
  count()	
  or	
  save())	
  
...	
  the	
  scheduler	
  examines	
  the	
  lineage	
  graph	
  
builds	
  a	
  DAG	
  of	
  stages	
  to	
  execute	
  
	
  
each	
  stage	
  is	
  a	
  maximal	
  pipeline	
  of	
  
transforma6ons	
  over	
  narrow	
  dependencies	
  
Michael	
  Mathioudakis	
   15	
  
scheduling	
  
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
Figure 5: Example of how Spark computes job stages. Boxes
with solid outlines are RDDs. Partitions are shaded rectangles,
in black if they are already in memory. To run an action on RDD
rdd	
  
par66on	
  
already	
  in	
  ram	
  
Michael	
  Mathioudakis	
   16	
  
memory	
  management	
  
when	
  not	
  enough	
  memory	
  
apply	
  LRU	
  evic6on	
  policy	
  at	
  rdd	
  level	
  
evict	
  par66on	
  from	
  least	
  recently	
  used	
  rdd	
  
Michael	
  Mathioudakis	
   17	
  
performance	
  
logis6c	
  regression	
  and	
  k-­‐means	
  
amazon	
  EC2	
  
10	
  itera6ons	
  on	
  100GB	
  datasets	
  
100	
  node-­‐clusters	
  
Michael	
  Mathioudakis	
   18	
  
performance	
  
-
e
m
-
r
-
e
n
80!
139!
46!
115!
182!
82!
76!
62!
3!
106!
87!
33!
0!
40!
80!
120!
160!
200!
240!
Hadoop! HadoopBM! Spark! Hadoop! HadoopBM! Spark!
Logistic Regression! K-Means!
Iterationtime(s)!
First Iteration!
Later Iterations!
Figure 7: Duration of the first and later iterations in Hadoop,
HadoopBinMem and Spark for logistic regression and k-means
using 100 GB of data on a 100-node cluster.Michael	
  Mathioudakis	
   19	
  
performance	
  
Example: Logistic Regression
0
500
1000
1500
2000
2500
3000
3500
4000
1 5 10 20 30
RunningTime(s)
Number of Iterations
Hadoop
Spark
110 s / iteration
first iteration 80 s
further iterations 1 s
logis6c	
  regression	
  
2015	
  
Michael	
  Mathioudakis	
   20	
  
spark	
  programming	
  
with	
  python	
  
Michael	
  Mathioudakis	
   21	
  
spark	
  programming	
  
crea6ng	
  rdds	
  
transforma6ons	
  &	
  ac6ons	
  
lazy	
  evalua6on	
  &	
  persistency	
  
passing	
  custom	
  func6ons	
  
working	
  with	
  key-­‐value	
  pairs	
  
data	
  par66oning	
  
accumulators	
  &	
  broadcast	
  variables	
  	
  
pyspark	
  
Michael	
  Mathioudakis	
   22	
  
driver	
  program	
  
contains	
  the	
  main	
  func6on	
  of	
  the	
  applica6on	
  
defines	
  rdds	
  
applies	
  opera6ons	
  on	
  them	
  
	
  
e.g.,	
  the	
  spark	
  shell	
  itself	
  is	
  a	
  driver	
  program	
  
	
  
driver	
  programs	
  access	
  spark	
  through	
  a	
  SparkContext	
  object	
  
Michael	
  Mathioudakis	
   23	
  
example	
  
import pyspark
sc = pyspark.SparkContext(master = "local", appName = "tour")
text = sc.textFile(“myfile.txt”) # load data
text.count() # count lines
opera6on	
  
create	
  rdd	
  
this	
  part	
  is	
  assumed	
  
from	
  now	
  on	
  
if	
  we	
  are	
  running	
  on	
  a	
  cluster	
  on	
  machines,	
  
different	
  machines	
  might	
  count	
  different	
  parts	
  of	
  the	
  file	
  
SparkContext	
  
automa6cally	
  created	
  in	
  
spark	
  shell	
  
Michael	
  Mathioudakis	
   24	
  
example	
  
text = sc.textFile("myfile.txt") # load data
# keep only lines that mention "Spark"
spark_lines = text.filter(lambda line: 'Spark' in line)
spark_lines.count() # count lines
opera6on	
  with	
  custom	
  func6on	
  
on	
  a	
  cluster,	
  Spark	
  ships	
  the	
  func6on	
  to	
  all	
  workers	
  
Michael	
  Mathioudakis	
   25	
  
lambda	
  func6ons	
  in	
  python	
  
f =( lambda line: 'Spark' in line )
f("we are learning Spark”)
def f(line):
return 'Spark' in line
f("we are learning Spark")
Michael	
  Mathioudakis	
   26	
  
stopping	
  
text = sc.textFile("myfile.txt") # load data
# keep only lines that mention "Spark"
spark_lines = text.filter(lambda line: 'Spark' in line)
spark_lines.count() # count lines
sc.stop()
Michael	
  Mathioudakis	
   27	
  
rdds	
  
resilient	
  distributed	
  datasets	
  
	
  
resilient	
  	
  
easy	
  to	
  recover	
  
distributed	
  
different	
  par66ons	
  materialize	
  on	
  different	
  nodes	
  
	
  
read-­‐only	
  (immutable)	
  
but	
  can	
  be	
  transformed	
  to	
  other	
  rdds	
  
Michael	
  Mathioudakis	
   28	
  
crea6ng	
  rdds	
  
loading	
  an	
  external	
  dataset	
  
text	
  =	
  sc.textFile("myfile.txt")	
  
distribu6ng	
  a	
  collec6on	
  of	
  objects	
  
data	
  =	
  sc.parallelize(	
  [0,1,2,3,4,5,6,7,8,9]	
  )	
  
transforming	
  other	
  rdds	
  
text_spark	
  =	
  text.filter(lambda	
  line:	
  'Spark'	
  in	
  line)	
  	
  
data_length	
  =	
  data.map(lambda	
  num:	
  num	
  **	
  2)	
  
Michael	
  Mathioudakis	
   29	
  
rdd	
  opera6ons	
  
transforma6ons	
  
return	
  a	
  new	
  rdd	
  
	
  
ac6ons	
  
extract	
  informa6on	
  from	
  
an	
  rdd	
  or	
  save	
  it	
  to	
  disk	
  
errorsRDD	
  =	
  inputRDD.filter(lambda	
  x:	
  "error"	
  in	
  x)	
  
warningsRDD	
  =	
  inputRDD.filter(lambda	
  x:	
  "warning"	
  in	
  x)	
  
badlinesRDD	
  =	
  errorsRDD.union(warningsRDD)	
  
print("Input	
  had",	
  badlinesRDD.count(),	
  "concerning	
  lines.")	
  
print("Here	
  are	
  some	
  of	
  them:")	
  
for	
  line	
  in	
  badlinesRDD.take(10):	
  
	
  	
  	
  	
  print(line)	
  
inputRDD	
  =	
  sc.textFile(”logfile.txt")	
  
Michael	
  Mathioudakis	
   30	
  
rdd	
  opera6ons	
  
def	
  is_prime(num):	
  
	
  	
  	
  	
  if	
  num	
  <	
  1	
  or	
  num	
  %	
  1	
  !=	
  0:	
  
	
  	
  	
  	
  	
  	
  	
  	
  raise	
  Excep6on("invalid	
  argument")	
  
	
  	
  	
  	
  for	
  d	
  in	
  range(2,	
  int(np.sqrt(num)	
  +	
  1)):	
  
	
  	
  	
  	
  	
  	
  	
  	
  if	
  num	
  %	
  d	
  ==	
  0:	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  return	
  False	
  
	
  	
  	
  	
  return	
  True	
  
numbersRDD	
  =	
  sc.parallelize(list(range(1,	
  1000000)))	
  
	
  
primesRDD	
  =	
  numbersRDD.filter(is_prime)	
  
	
  
primes	
  =	
  primesRDD.collect()	
  
	
  
print(primes[:100])	
  
create	
  RDD	
  
transforma6on	
  
ac6on	
  
opera6on	
  in	
  driver	
  what	
  if	
  primes	
  does	
  
not	
  fit	
  in	
  memory?	
  
transforma6ons	
  
return	
  a	
  new	
  rdd	
  
	
  
ac6ons	
  
extract	
  informa6on	
  from	
  
an	
  rdd	
  or	
  save	
  it	
  to	
  disk	
  
Michael	
  Mathioudakis	
   31	
  
rdd	
  opera6ons	
  
def	
  is_prime(num):	
  
	
  	
  	
  	
  if	
  num	
  <	
  1	
  or	
  num	
  %	
  1	
  !=	
  0:	
  
	
  	
  	
  	
  	
  	
  	
  	
  raise	
  Excep6on("invalid	
  argument")	
  
	
  	
  	
  	
  for	
  d	
  in	
  range(2,	
  int(np.sqrt(num)	
  +	
  1)):	
  
	
  	
  	
  	
  	
  	
  	
  	
  if	
  num	
  %	
  d	
  ==	
  0:	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  return	
  False	
  
	
  	
  	
  	
  return	
  True	
  
numbersRDD	
  =	
  sc.parallelize(list(range(1,	
  1000000)))	
  
	
  
primesRDD	
  =	
  numbersRDD.filter(is_prime)	
  
	
  
primesRDD.saveAsTextFile("primes.txt")	
  
transforma6ons	
  
return	
  a	
  new	
  rdd	
  
	
  
ac6ons	
  
extract	
  informa6on	
  from	
  
an	
  rdd	
  or	
  save	
  it	
  to	
  disk	
  
Michael	
  Mathioudakis	
   32	
  
rdds	
  
evaluated	
  lazily	
  
ephemeral	
  
can	
  persist	
  in	
  memory	
  (or	
  disk)	
  if	
  we	
  ask	
  
Michael	
  Mathioudakis	
   33	
  
lazy	
  evalua6on	
  
numbersRDD	
  =	
  sc.parallelize(range(1,	
  1000000))	
  
	
  
primesRDD	
  =	
  numbersRDD.filter(is_prime)	
  
	
  
primesRDD.saveAsTextFile("primes.txt")	
  
no	
  cluster	
  ac6vity	
  un6l	
  here	
  
numbersRDD	
  =	
  sc.parallelize(range(1,	
  1000000))	
  
	
  
primesRDD	
  =	
  numbersRDD.filter(is_prime)	
  
	
  
primes	
  =	
  primesRDD.collect()	
  
	
  
print(primes[:100])	
  
no	
  cluster	
  ac6vity	
  un6l	
  here	
  
Michael	
  Mathioudakis	
   34	
  
persistence	
  
RDDs	
  can	
  persist	
  in	
  memory,	
  
if	
  we	
  ask	
  politely	
  
numbersRDD	
  =	
  sc.parallelize(list(range(1,	
  1000000)))	
  
	
  
primesRDD	
  =	
  numbersRDD.filter(is_prime)	
  
	
  
primesRDD.persist()	
  
	
  
primesRDD.count()	
  
	
  
primesRDD.take(10)	
  
RDD	
  already	
  in	
  
memory	
  
causes	
  RDD	
  to	
  
materialize	
  
Michael	
  Mathioudakis	
   35	
  
persistence	
  
why?	
  
screenshot	
  from	
  jupyter	
  notebook	
  
Michael	
  Mathioudakis	
   36	
  
persistence	
  
we	
  can	
  ask	
  Spark	
  to	
  maintain	
  rdds	
  on	
  disk	
  
or	
  even	
  keep	
  replicas	
  on	
  different	
  nodes	
  
data.persist(pyspark.StorageLevel(useDisk = True, useMemory = True, replication=2))
to	
  cease	
  persistence	
  	
  
data.unpersist()
removes	
  rdd	
  from	
  memory	
  and	
  disk	
  
Michael	
  Mathioudakis	
   37	
  
passing	
  func6ons	
  
lambda	
  func6ons	
  
func6on	
  references	
  
text = sc.textFile("myfile.txt")
text_spark = text.filter(lambda line: 'Spark' in line)
def f(line):
return 'Spark' in line
text = sc.textFile("myfile.txt")
text_spark = text.filter(f)
Michael	
  Mathioudakis	
   38	
  
passing	
  func6ons	
  
warning!	
  
if	
  func6on	
  is	
  member	
  of	
  an	
  object	
  (self.method)	
  or	
  
references	
  fields	
  of	
  an	
  object	
  (e.g.,	
  self.field)...	
  
Spark	
  serializes	
  and	
  sends	
  the	
  en6re	
  object	
  
to	
  worker	
  nodes	
  
this	
  can	
  be	
  very	
  inefficient	
  
Michael	
  Mathioudakis	
   39	
  
passing	
  func6ons	
  
class SearchFunctions(object):
def __init__(self, query):
self.query
def is_match(self, s):
return self.query in s
def get_matches_in_rdd_v1(self, rdd):
return rdd.filter(self.is_match)
def get_matches_in_rdd_v2(self, rdd):
return rdd.filter(lambda x: self.query in x)
where	
  is	
  the	
  problem	
  in	
  the	
  code	
  below?	
  
Michael	
  Mathioudakis	
   40	
  
passing	
  func6ons	
  
class SearchFunctions(object):
def __init__(self, query):
self.query
def is_match(self, s):
return self.query in s
def get_matches_in_rdd_v1(self, rdd):
return rdd.filter(self.is_match)
def get_matches_in_rdd_v2(self, rdd):
return rdd.filter(lambda x: self.query in x)
where	
  is	
  the	
  problem	
  in	
  the	
  code	
  below?	
  
reference	
  to	
  object	
  method	
  
reference	
  to	
  object	
  field	
  
Michael	
  Mathioudakis	
   41	
  
passing	
  func6ons	
  
beUer	
  implementa6on	
  
class SearchFunctions(object):
def __init__(self, query):
self.query
def is_match(self, s):
return self.query in s
def get_matches_in_rdd(self, rdd):
query = self.query
return rdd.filter(lambda x: query in x)
Michael	
  Mathioudakis	
   42	
  
common	
  rdd	
  opera6ons	
  
element-­‐wise	
  transforma6ons	
  
map	
  and	
  filter	
  
inputRDD	
  
{1,2,3,4}	
  
mappedRDD	
  
{1,2,3,4}	
  
filteredRDD	
  
{2,3,4}	
  
.map(lambda	
  x:	
  x**2)	
   .filter(lambda	
  x:	
  x!=1)	
  
map’s	
  return	
  type	
  can	
  be	
  different	
  that	
  its	
  input’s	
  
Michael	
  Mathioudakis	
   43	
  
common	
  rdd	
  opera6ons	
  
element-­‐wise	
  transforma6ons	
  
produce	
  mul6ple	
  elements	
  per	
  input	
  element	
  
flatMap	
  
phrases = sc.parallelize(["hello world", "how are you", "how do you do"])
words = phrases.flatMap(lambda phrase: phrase.split(" "))
words.count()
9	
  
Michael	
  Mathioudakis	
   44	
  
common	
  rdd	
  opera6ons	
  
phrases = sc.parallelize(["hello world", "how are you", "how do you do"])
words = phrases.flatMap(lambda phrase: phrase.split(" "))
words.collect()
phrases = sc.parallelize(["hello world", "how are you", "how do you do"])
words = phrases.map(lambda phrase: phrase.split(" "))
words.collect()
how	
  is	
  the	
  result	
  different?	
  
[['hello',	
  'world'],	
  ['how',	
  'are',	
  'you'],	
  ['how',	
  'do',	
  'you',	
  'do']]	
  
['hello',	
  'world',	
  'how',	
  'are',	
  'you',	
  'how',	
  'do',	
  'you',	
  'do']	
  
Michael	
  Mathioudakis	
   45	
  
common	
  rdd	
  opera6ons	
  
oneRDD = sc.parallelize([1, 1, 1, 2, 3, 3, 4, 4])
oneRDD.persist()
otherRDD = sc.parallelize([1, 4, 4, 7])
otherRDD.persist()
(pseudo)	
  set	
  opera6ons	
  
Michael	
  Mathioudakis	
   46	
  
common	
  rdd	
  opera6ons	
  
(pseudo)	
  set	
  opera6ons	
  
union	
  
oneRDD = sc.parallelize([1, 1, 1, 2, 3, 3, 4, 4])
oneRDD.persist()
otherRDD = sc.parallelize([1, 4, 4, 7])
otherRDD.persist()
oneRDD.union(otherRDD).collect()
[1,	
  1,	
  1,	
  2,	
  3,	
  3,	
  4,	
  4,	
  1,	
  4,	
  4,	
  7]	
  
Michael	
  Mathioudakis	
   47	
  
common	
  rdd	
  opera6ons	
  
subtrac6on	
  
oneRDD.subtract(otherRDD).collect()
[2,	
  3,	
  3]	
  
(pseudo)	
  set	
  opera6ons	
  
oneRDD = sc.parallelize([1, 1, 1, 2, 3, 3, 4, 4])
oneRDD.persist()
otherRDD = sc.parallelize([1, 4, 4, 7])
otherRDD.persist()
Michael	
  Mathioudakis	
   48	
  
common	
  rdd	
  opera6ons	
  
duplicate	
  removal	
  
oneRDD.distinct().collect()
[1,	
  2,	
  3,	
  4]	
  
(pseudo)	
  set	
  opera6ons	
  
oneRDD = sc.parallelize([1, 1, 1, 2, 3, 3, 4, 4])
oneRDD.persist()
otherRDD = sc.parallelize([1, 4, 4, 7])
otherRDD.persist()
Michael	
  Mathioudakis	
   49	
  
common	
  rdd	
  opera6ons	
  
intersec6on	
  
oneRDD.intersection(otherRDD).collect()
[1,	
  4]	
  
(pseudo)	
  set	
  opera6ons	
  
oneRDD = sc.parallelize([1, 1, 1, 2, 3, 3, 4, 4])
oneRDD.persist()
otherRDD = sc.parallelize([1, 4, 4, 7])
otherRDD.persist()
removes	
  duplicates!	
  
Michael	
  Mathioudakis	
   50	
  
common	
  rdd	
  opera6ons	
  
cartesian	
  product	
  
oneRDD.cartesian(otherRDD).collect()[:5]
[(1,	
  1),	
  (1,	
  4),	
  (1,	
  4),	
  (1,	
  7),	
  (1,	
  1)]	
  
(pseudo)	
  set	
  opera6ons	
  
oneRDD = sc.parallelize([1, 1, 1, 2, 3, 3, 4, 4])
oneRDD.persist()
otherRDD = sc.parallelize([1, 4, 4, 7])
otherRDD.persist()
Michael	
  Mathioudakis	
   51	
  
common	
  rdd	
  opera6ons	
  
(pseudo)	
  set	
  opera6ons	
  
union	
   subtrac6on	
  
duplicate	
  removal	
  
intersec6on	
  
cartesian	
  product	
  
big	
  difference	
  in	
  implementa6on	
  
(and	
  efficiency)	
  
par66on	
  shuffling	
  
yes	
  no	
  
Michael	
  Mathioudakis	
   52	
  
sortBy	
  
common	
  rdd	
  opera6ons	
  
how	
  is	
  sortBy	
  implemented?	
  
we’ll	
  see	
  later...	
  
data = sc.parallelize(np.random.rand(10))
data.sortBy(lambda x: x)
Michael	
  Mathioudakis	
   53	
  
common	
  rdd	
  opera6ons	
  
ac6ons	
  
reduce	
  
successively	
  operates	
  on	
  two	
  elements	
  of	
  rdd	
  
returns	
  new	
  element	
  of	
  same	
  type	
  
data.reduce(lambda x, y: x + y)
181	
  
data = sc.parallelize([1,43,62,23,52])
data.reduce(lambda x, y: x * y)
3188536	
  
commuta6ve	
  
&	
  associa6ve	
  
func6ons	
  
Michael	
  Mathioudakis	
   54	
  
commuta6ve	
  &	
  associa6ve	
  	
  
func6on	
  f(x,	
  y)	
  
e.g.,	
  add(x,	
  y)	
  =	
  x	
  +	
  y	
  
	
  
commuta6ve	
  
f(x,	
  y)	
  =	
  f(y,	
  x)	
  
e.g.,	
  add(x,	
  y)	
  =	
  x	
  +	
  y	
  =	
  y	
  +	
  x	
  =	
  add(y,	
  x)	
  
	
  
associa6ve	
  
f(x,	
  f(y,	
  z))	
  =	
  f((x,	
  y),	
  z)	
  
e.g.,	
  add(x,	
  add(y,	
  z))	
  =	
  x	
  +	
  (y	
  +	
  z)	
  =	
  (x	
  +	
  y)	
  +	
  z	
  =	
  add(add(x,	
  y),	
  z)	
  	
  
Michael	
  Mathioudakis	
   55	
  
common	
  rdd	
  opera6ons	
  
ac6ons	
  
data = sc.parallelize([1,43,62,23,52])
data.reduce(lambda x, y: x**2 + y**2)
137823683725010149883130929	
  
compute	
  sum	
  of	
  squares	
  of	
  data	
  
is	
  the	
  following	
  correct?	
  
no	
  -­‐	
  why?	
  
reduce	
  
successively	
  operates	
  on	
  two	
  elements	
  of	
  rdd	
  
produces	
  single	
  aggregate	
  
56	
  
common	
  rdd	
  opera6ons	
  
ac6ons	
  
data = sc.parallelize([1,43,62,23,52])
data.reduce(lambda x, y: np.sqrt(x**2 + y**2)) ** 2
8927.0	
  
yes	
  -­‐	
  why?	
  
compute	
  sum	
  of	
  squares	
  of	
  data	
  
is	
  the	
  following	
  correct?	
  
reduce	
  
successively	
  operates	
  on	
  two	
  elements	
  of	
  rdd	
  
produces	
  single	
  aggregate	
  
Michael	
  Mathioudakis	
   57	
  
common	
  rdd	
  opera6ons	
  
ac6ons	
  
aggregate	
  
generalizes	
  reduce	
  
the	
  user	
  provides	
  
a	
  zero	
  value	
  
the	
  iden6ty	
  element	
  for	
  aggrega6on	
  
a	
  sequen6al	
  opera6on	
  (func6on)	
  
to	
  update	
  aggrega6on	
  for	
  one	
  more	
  element	
  in	
  one	
  par66on	
  	
  
a	
  combining	
  opera6on	
  (func6on)	
  
to	
  combine	
  aggregates	
  from	
  different	
  par66ons	
  
Michael	
  Mathioudakis	
   58	
  
common	
  rdd	
  opera6ons	
  
ac6ons	
  
data = sc.parallelize([1,43,62,23,52])
aggr = data.aggregate(zeroValue = (0,0),
seqOp = (lambda x, y: (x[0] + y, x[1] + 1)),
combOp = (lambda x, y: (x[0] + y[0], x[1] + y[1])))
aggr[0] / aggr[1]
what	
  does	
  the	
  following	
  compute?	
  
zero	
  
value	
  
sequen6al	
  
opera6on	
  
combining	
  
opera6on	
  
the	
  average	
  value	
  of	
  data	
  
aggregate	
  
generalizes	
  reduce	
  
Michael	
  Mathioudakis	
   59	
  
common	
  rdd	
  opera6ons	
  
ac6ons	
  
opera5on	
   return	
  
collect	
   all	
  elements	
  
take(num)	
  
num	
  elements	
  
tries	
  to	
  minimize	
  disk	
  access	
  (e.g.,	
  by	
  accessing	
  one	
  
par66on)	
  
takeSample	
   a	
  random	
  sample	
  of	
  elements	
  
count	
   number	
  of	
  elements	
  
countByValue	
  
number	
  of	
  6me	
  each	
  element	
  appears	
  
first	
  on	
  each	
  par66on,	
  then	
  combines	
  par66on	
  results	
  
top(num)	
  
num	
  maximum	
  elements	
  
sorts	
  par66ons	
  and	
  merges	
  
Michael	
  Mathioudakis	
   60	
  
all	
  opera6ons	
  we	
  have	
  described	
  
so	
  far	
  apply	
  to	
  all	
  rdds	
  
	
  
that’s	
  why	
  the	
  word	
  “common”	
  has	
  been	
  in	
  the	
  6tle	
  
common	
  rdd	
  opera6ons	
  
Michael	
  Mathioudakis	
   61	
  
pair	
  rdds	
  
elements	
  are	
  key-­‐value	
  pairs	
  
pairRDD = sc.parallelize(range(100)).map(lambda x: (x, x**2))
pairRDD.collect()[:5]
[(0,	
  0),	
  (1,	
  1),	
  (2,	
  4),	
  (3,	
  9),	
  (4,	
  16)]	
  
they	
  come	
  from	
  mapreduce	
  model	
  
prac6cal	
  in	
  many	
  cases	
  
spark	
  provides	
  opera6ons	
  tailored	
  to	
  pair	
  rdds	
  
Michael	
  Mathioudakis	
   62	
  
transforma6ons	
  on	
  pair	
  rdds	
  
pairRDD = sc.parallelize(range(100)).map(lambda x: (x, x**2))
keys	
  and	
  values	
  
pairRDD.keys().collect()[:5]
[0,	
  1,	
  2,	
  3,	
  4]	
  
pairRDD.values().collect()[:5]
[0,	
  1,	
  4,	
  9,	
  16]	
  
Michael	
  Mathioudakis	
   63	
  
transforma6ons	
  on	
  pair	
  rdds	
  
reduceByKey	
  
pairRDD = sc.parallelize([ ('$APPL', 100.64), ('$GOOG', 706.2),
('$AMZN', 552.32), ('$APPL', 100.52), ('$AMZN', 552.32) ])
volumePerKey = pairRDD.reduceByKey(lambda x, y: x + y)
volumePerKey.collect()
[('$APPL',	
  201.16),	
  ('$AMZN',	
  1104.64),	
  ('$GOOG',	
  706.2)]	
  
reduceByKey	
  is	
  a	
  transforma6on	
  
unlike	
  reduce	
  
Michael	
  Mathioudakis	
   64	
  
transforma6ons	
  on	
  pair	
  rdds	
  
combineByKey	
  
generalizes	
  reduceByKey	
  
user	
  provides	
  
createCombiner	
  func6on	
  
provides	
  the	
  zero	
  value	
  for	
  each	
  key	
  
mergeValue	
  func6on	
  
combines	
  current	
  aggregate	
  in	
  one	
  par66on	
  with	
  new	
  value	
  
mergeCombiner	
  
to	
  combine	
  aggregates	
  from	
  par66ons	
  
Michael	
  Mathioudakis	
   65	
  
combineByKey	
  
generalizes	
  reduceByKey	
  
pairRDD = sc.parallelize([ ('$APPL', 100.64), ('$GOOG', 706.2),
('$AMZN', 552.32), ('$APPL', 100.52), ('$AMZN', 552.32) ])
aggr = pairRDD.combineByKey(createCombiner = lambda x: (x, 1),
mergeValue = lambda x, y: (x[0] + y, x[1] + 1),
mergeCombiners = lambda x, y: (x[0] + y[0], x[1] + y[1]))
avgPerKey = aggr.map(lambda x: (x[0], x[1][0]/x[1][1]))
avgPerKey.collect()
what	
  does	
  the	
  following	
  produce	
  
transforma6ons	
  on	
  pair	
  rdds	
  
Michael	
  Mathioudakis	
   66	
  
transforma6ons	
  on	
  pair	
  rdds	
  
sortByKey	
  
	
  
samples	
  values	
  from	
  rdd	
  to	
  
es6mate	
  sorted	
  par66on	
  boundaries	
  
shuffles	
  data	
  
sorts	
  by	
  external	
  sor6ng	
  
	
  
used	
  to	
  implement	
  common	
  sortBy	
  
idea:	
  create	
  a	
  pair	
  RDD	
  with	
  (sort-­‐key,	
  item)	
  elements	
  
apply	
  sortByKey	
  on	
  that	
  
Michael	
  Mathioudakis	
   67	
  
transforma6ons	
  on	
  pair	
  rdds	
  
implemented	
  with	
  a	
  variant	
  of	
  hash	
  join	
  
spark	
  also	
  has	
  func6ons	
  for	
  
le|	
  outer	
  join,	
  right	
  outer	
  join,	
  full	
  outer	
  join	
  
(inner)	
  join	
  
Michael	
  Mathioudakis	
   68	
  
transforma6ons	
  on	
  pair	
  rdds	
  
(inner)	
  join	
  
course_a = sc.parallelize([ ("Antti", 8), ("Tuukka", 10), (“Leena", 9)])
course_b = sc.parallelize([ ("Leena", 10), ("Tuukka", 10)])
result = course_a.join(course_b)
result.collect()
[('Tuukka',	
  (10,	
  10)),	
  ('Leena',	
  (9,	
  10))]	
  
Michael	
  Mathioudakis	
   69	
  
transforma6ons	
  on	
  pair	
  rdds	
  
other	
  transforma6ons	
  
groupByKey	
  
pairRDD = sc.parallelize([ ('$APPL', 100.64), ('$GOOG', 706.2),
('$AMZN', 552.32), ('$APPL', 100.52), ('$AMZN', 552.32) ])
pairRDD.groupByKey().collect()
[('$APPL',	
  <	
  values	
  >),	
  	
  ('$AMZN',	
  <	
  values	
  >),	
  	
  ('$GOOG',	
  <	
  values	
  >)]	
  
for	
  grouping	
  together	
  mul6ple	
  rdds	
  
cogroup	
  and	
  groupWith	
  
Michael	
  Mathioudakis	
   70	
  
ac6ons	
  on	
  pair	
  rdds	
  
lookup(key)	
  
pairRDD = sc.parallelize([ ('$APPL', 100.64), ('$GOOG', 706.2),
('$AMZN', 552.32), ('$APPL', 100.52), ('$AMZN', 552.32) ])
countByKey	
  
{'$AMZN':	
  2,	
  '$APPL':	
  2,	
  '$GOOG':	
  1}	
  
pairRDD.countByKey()
collectAsMap	
  
pairRDD.collectAsMap()
{'$AMZN':	
  552.32,	
  '$APPL':	
  100.52,	
  '$GOOG':	
  706.2}	
  
pairRDD.lookup("$APPL")
[100.64,	
  100.52]	
   Michael	
  Mathioudakis	
   71	
  
shared	
  variables	
  
accumulators	
  
write-­‐only	
  for	
  workers	
  
	
  
broadcast	
  variables	
  
read-­‐only	
  for	
  workers	
  
Michael	
  Mathioudakis	
   72	
  
accumulators	
  
text = sc.textFile("myfile.txt")
long_lines = sc.accumulator(0)
def line_len(line):
global long_lines
length = len(line)
if length > 30:
long_lines += 1
return length
llengthRDD = text.map(line_len)
llengthRDD.count()
95	
  
long_lines.value
45	
  
lazy!	
  
Michael	
  Mathioudakis	
   73	
  
accumulators	
  
fault	
  tolerance	
  
	
  
spark	
  executes	
  updates	
  in	
  ac6ons	
  only	
  once	
  
e.g.,	
  foreach()	
  
foreach:	
  special	
  ac6on	
  
	
  
this	
  is	
  not	
  guaranteed	
  for	
  transforma=ons	
  
in	
  transforma6ons,	
  use	
  accumulators	
  
only	
  for	
  debugging	
  purposes!	
  
Michael	
  Mathioudakis	
   74	
  
accumulators	
  +	
  foreach	
  
text = sc.textFile("myfile.txt")
long_lines = sc.accumulator(0)
def line_len(line):
global long_lines
length = len(line)
if length > 30:
long_lines += 1
text.foreach(line_len)
long_lines.value
45	
  
Michael	
  Mathioudakis	
   75	
  
broadcast	
  variables	
  
	
  
sent	
  to	
  workers	
  only	
  once	
  
read-­‐only	
  
even	
  if	
  you	
  change	
  its	
  value	
  on	
  a	
  worker,	
  
the	
  change	
  does	
  not	
  propagate	
  to	
  other	
  workers	
  
(actually	
  broadcast	
  object	
  is	
  wriUen	
  to	
  file,	
  read	
  
from	
  there	
  by	
  each	
  worker)	
  
	
  
release	
  with	
  unpersist()	
  
Michael	
  Mathioudakis	
   76	
  
broadcast	
  variables	
  
def load_address_table():
return {"Anu": "Chem. A143", "Karmen": "VTT, 74", "Michael": "OIH, B253.2",
"Anwar": "T, B103", "Orestis": "T, A341", "Darshan": "T, A325"}
address_table = sc.broadcast(load_address_table())
def find_address(name):
res = None
if name in address_table.value:
res = address_table.value[name]
return res
data = sc.parallelize(["Anwar", "Michael", "Orestis", "Darshan"])
pairRDD = data.map(lambda name: (name, find_address(name)))
pairRDD.collectAsMap()
Michael	
  Mathioudakis	
   77	
  
par66oning	
  
certain	
  opera6ons	
  take	
  advantage	
  
of	
  par66oning	
  
e.g.,	
  reduceByKey,	
  join	
  
anRDD.par66onBy(numPar66ons,	
  par66onFunc)	
  
users	
  can	
  set	
  number	
  of	
  par66ons	
  and	
  par66oning	
  func6on	
  
Michael	
  Mathioudakis	
   78	
  
working	
  on	
  per-­‐par66on	
  basis	
  
spark	
  provides	
  opera6ons	
  that	
  operate	
  
at	
  par66on	
  level	
  
e.g.,	
  mapPar66on	
  
rdd = sc.parallelize(range(100), 4)
def f(iterator): yield sum(iterator)
rdd.mapPartitions(f).collect()
used	
  in	
  implementa6on	
  of	
  Spark	
  
Michael	
  Mathioudakis	
   79	
  
see	
  implementa6on	
  of	
  spark	
  on	
  
hUps://github.com/apache/spark/	
  
	
  
	
  
Michael	
  Mathioudakis	
   80	
  
references	
  
	
  
1.  Zaharia,	
  Matei,	
  et	
  al.	
  "Spark:	
  Cluster	
  Compu6ng	
  with	
  Working	
  Sets."	
  
HotCloud	
  10	
  (2010):	
  10-­‐10.	
  
2.  Zaharia,	
  Matei,	
  et	
  al.	
  "Resilient	
  distributed	
  datasets:	
  A	
  fault-­‐tolerant	
  
abstrac6on	
  for	
  in-­‐memory	
  cluster	
  compu6ng."	
  Proceedings	
  of	
  the	
  9th	
  
USENIX	
  conference	
  on	
  Networked	
  Systems	
  Design	
  and	
  Implementa=on.	
  
3.  Learning	
  Spark:	
  Lightning-­‐Fast	
  Big	
  Data	
  Analysis,	
  by	
  Holden	
  Karau,	
  Andy	
  
Konwinski,	
  Patrick	
  Wendell,	
  Matei	
  Zaharia	
  
4.  Spark	
  programming	
  guide	
  
hUps://spark.apache.org/docs/latest/programming-­‐guide.html	
  
5.  Spark	
  implementa6on	
  hUps://github.com/apache/spark/	
  
6.  "Making	
  Big	
  Data	
  Processing	
  Simple	
  with	
  Spark,"	
  Matei	
  Zaharia,	
  
hUps://youtu.be/d9D-­‐Z3-­‐44F8	
  	
  
	
  
Michael	
  Mathioudakis	
   81	
  

More Related Content

What's hot

Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsCarl Lu
 
3 R Tutorial Data Structure
3 R Tutorial Data Structure3 R Tutorial Data Structure
3 R Tutorial Data StructureSakthi Dasans
 
Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets robertlz
 
Data structures "1" (Lectures 2015-2016)
Data structures "1" (Lectures 2015-2016) Data structures "1" (Lectures 2015-2016)
Data structures "1" (Lectures 2015-2016) Ameer B. Alaasam
 
Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]Alexander Hendorf
 
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduceComputing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduceLeonidas Akritidis
 
Graphing stata (2 hour course)
Graphing stata (2 hour course)Graphing stata (2 hour course)
Graphing stata (2 hour course)izahn
 
Effect of countries in performance of hadoop.
Effect of countries in performance of hadoop.Effect of countries in performance of hadoop.
Effect of countries in performance of hadoop.Computer Science Journals
 
Stata Cheat Sheets (all)
Stata Cheat Sheets (all)Stata Cheat Sheets (all)
Stata Cheat Sheets (all)Laura Hughes
 
report on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hivereport on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hivesiddharthboora
 
Write intensive workloads and lsm trees
Write intensive workloads and lsm treesWrite intensive workloads and lsm trees
Write intensive workloads and lsm treesTilak Patidar
 
The design and implementation of modern column oriented databases
The design and implementation of modern column oriented databasesThe design and implementation of modern column oriented databases
The design and implementation of modern column oriented databasesTilak Patidar
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesDavid Tjahjono,MD,MBA(UK)
 
Building a PII scrubbing layer
Building a PII scrubbing layerBuilding a PII scrubbing layer
Building a PII scrubbing layerTilak Patidar
 
4.4 external hashing
4.4 external hashing4.4 external hashing
4.4 external hashingKrish_ver2
 
Python data structures - best in class for data analysis
Python data structures -   best in class for data analysisPython data structures -   best in class for data analysis
Python data structures - best in class for data analysisRajesh M
 

What's hot (20)

Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasets
 
3 R Tutorial Data Structure
3 R Tutorial Data Structure3 R Tutorial Data Structure
3 R Tutorial Data Structure
 
Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets
 
Data structures "1" (Lectures 2015-2016)
Data structures "1" (Lectures 2015-2016) Data structures "1" (Lectures 2015-2016)
Data structures "1" (Lectures 2015-2016)
 
Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]
 
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduceComputing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
 
Graphing stata (2 hour course)
Graphing stata (2 hour course)Graphing stata (2 hour course)
Graphing stata (2 hour course)
 
Effect of countries in performance of hadoop.
Effect of countries in performance of hadoop.Effect of countries in performance of hadoop.
Effect of countries in performance of hadoop.
 
Stata Cheat Sheets (all)
Stata Cheat Sheets (all)Stata Cheat Sheets (all)
Stata Cheat Sheets (all)
 
report on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hivereport on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hive
 
Write intensive workloads and lsm trees
Write intensive workloads and lsm treesWrite intensive workloads and lsm trees
Write intensive workloads and lsm trees
 
Google's Dremel
Google's DremelGoogle's Dremel
Google's Dremel
 
User biglm
User biglmUser biglm
User biglm
 
The design and implementation of modern column oriented databases
The design and implementation of modern column oriented databasesThe design and implementation of modern column oriented databases
The design and implementation of modern column oriented databases
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
Building a PII scrubbing layer
Building a PII scrubbing layerBuilding a PII scrubbing layer
Building a PII scrubbing layer
 
4.4 external hashing
4.4 external hashing4.4 external hashing
4.4 external hashing
 
Python pandas Library
Python pandas LibraryPython pandas Library
Python pandas Library
 
Python data structures - best in class for data analysis
Python data structures -   best in class for data analysisPython data structures -   best in class for data analysis
Python data structures - best in class for data analysis
 
Pandas
PandasPandas
Pandas
 

Viewers also liked

Modern Database Systems - Lecture 01
Modern Database Systems - Lecture 01Modern Database Systems - Lecture 01
Modern Database Systems - Lecture 01Michael Mathioudakis
 
Mining the Social Web - Lecture 2 - T61.6020
Mining the Social Web - Lecture 2 - T61.6020Mining the Social Web - Lecture 2 - T61.6020
Mining the Social Web - Lecture 2 - T61.6020Michael Mathioudakis
 
Bump Hunting in the Dark - ICDE15 presentation
Bump Hunting in the Dark - ICDE15 presentationBump Hunting in the Dark - ICDE15 presentation
Bump Hunting in the Dark - ICDE15 presentationMichael Mathioudakis
 
Mining the Social Web - Lecture 1 - T61.6020 lecture-01-slides
Mining the Social Web - Lecture 1 - T61.6020 lecture-01-slidesMining the Social Web - Lecture 1 - T61.6020 lecture-01-slides
Mining the Social Web - Lecture 1 - T61.6020 lecture-01-slidesMichael Mathioudakis
 
Modern Database Systems (for Genealogy)
Modern Database Systems (for Genealogy)Modern Database Systems (for Genealogy)
Modern Database Systems (for Genealogy)Steven Francia
 
Modern database management jeffrey a. hoffer, mary b. prescott,
Modern database management   jeffrey a. hoffer, mary b. prescott,  Modern database management   jeffrey a. hoffer, mary b. prescott,
Modern database management jeffrey a. hoffer, mary b. prescott, BlackIce86
 

Viewers also liked (7)

Modern Database Systems - Lecture 01
Modern Database Systems - Lecture 01Modern Database Systems - Lecture 01
Modern Database Systems - Lecture 01
 
Experiencia UNITEC en Odontología 2
Experiencia UNITEC en Odontología 2Experiencia UNITEC en Odontología 2
Experiencia UNITEC en Odontología 2
 
Mining the Social Web - Lecture 2 - T61.6020
Mining the Social Web - Lecture 2 - T61.6020Mining the Social Web - Lecture 2 - T61.6020
Mining the Social Web - Lecture 2 - T61.6020
 
Bump Hunting in the Dark - ICDE15 presentation
Bump Hunting in the Dark - ICDE15 presentationBump Hunting in the Dark - ICDE15 presentation
Bump Hunting in the Dark - ICDE15 presentation
 
Mining the Social Web - Lecture 1 - T61.6020 lecture-01-slides
Mining the Social Web - Lecture 1 - T61.6020 lecture-01-slidesMining the Social Web - Lecture 1 - T61.6020 lecture-01-slides
Mining the Social Web - Lecture 1 - T61.6020 lecture-01-slides
 
Modern Database Systems (for Genealogy)
Modern Database Systems (for Genealogy)Modern Database Systems (for Genealogy)
Modern Database Systems (for Genealogy)
 
Modern database management jeffrey a. hoffer, mary b. prescott,
Modern database management   jeffrey a. hoffer, mary b. prescott,  Modern database management   jeffrey a. hoffer, mary b. prescott,
Modern database management jeffrey a. hoffer, mary b. prescott,
 

Similar to Lecture 07 - CS-5040 - modern database systems

Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkReynold Xin
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Sameer Farooqui
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsRavindra kumar
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...Derryck Lamptey, MPhil, CISSP
 
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File SystemFredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File SystemFredrick Ishengoma
 
Study Notes: Apache Spark
Study Notes: Apache SparkStudy Notes: Apache Spark
Study Notes: Apache SparkGao Yunzhong
 
Fusing Transformations of Strict Scala Collections with Views
Fusing Transformations of Strict Scala Collections with ViewsFusing Transformations of Strict Scala Collections with Views
Fusing Transformations of Strict Scala Collections with ViewsPhilip Schwarz
 
January 2016 Meetup: Speeding up (big) data manipulation with data.table package
January 2016 Meetup: Speeding up (big) data manipulation with data.table packageJanuary 2016 Meetup: Speeding up (big) data manipulation with data.table package
January 2016 Meetup: Speeding up (big) data manipulation with data.table packageZurich_R_User_Group
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersMindsMapped Consulting
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEPaco Nathan
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Spark Summit
 

Similar to Lecture 07 - CS-5040 - modern database systems (20)

Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skills
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Unit 2
Unit 2Unit 2
Unit 2
 
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
 
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File SystemFredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
 
Cassandra no sql ecosystem
Cassandra no sql ecosystemCassandra no sql ecosystem
Cassandra no sql ecosystem
 
Study Notes: Apache Spark
Study Notes: Apache SparkStudy Notes: Apache Spark
Study Notes: Apache Spark
 
RDD
RDDRDD
RDD
 
Fusing Transformations of Strict Scala Collections with Views
Fusing Transformations of Strict Scala Collections with ViewsFusing Transformations of Strict Scala Collections with Views
Fusing Transformations of Strict Scala Collections with Views
 
January 2016 Meetup: Speeding up (big) data manipulation with data.table package
January 2016 Meetup: Speeding up (big) data manipulation with data.table packageJanuary 2016 Meetup: Speeding up (big) data manipulation with data.table package
January 2016 Meetup: Speeding up (big) data manipulation with data.table package
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
 
2nd Semester M Tech: Structural Engineering (June-2015) Question Papers
2nd  Semester M Tech: Structural Engineering  (June-2015) Question Papers2nd  Semester M Tech: Structural Engineering  (June-2015) Question Papers
2nd Semester M Tech: Structural Engineering (June-2015) Question Papers
 
MapReduce in Cloud Computing
MapReduce in Cloud ComputingMapReduce in Cloud Computing
MapReduce in Cloud Computing
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
 

Recently uploaded

GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
FILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinoFILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinojohnmickonozaleda
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 

Recently uploaded (20)

LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
FILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinoFILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipino
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 

Lecture 07 - CS-5040 - modern database systems

  • 1. Modern  Database  Systems   Lecture  7   Aris6des  Gionis   Michael  Mathioudakis     Spring  2016  
  • 2. logis6cs   assignment  1    currently  marking   might  upload  best  student  solu6ons?   will  provide  annotated  pdfs  with  marks     virtualbox  has  same  path  &  filename  as  previous  one   Michael  Mathioudakis   2  
  • 3. con6nuing  from  last  lecture   original  paper  on  spark   Michael  Mathioudakis   3  
  • 4. previously...    on  modern  database  systems   generaliza6on  of  mapreduce   more  suitable  for  itera6ve  workflows   itera6ve  algorithms  and  repe66ve  querying     rdd:  resilient  distributed  dataset   read-­‐only,  lazily  evaluated,  easily  re-­‐created   ephemeral,  unless  we  need  to  keep  them  in  memory   Michael  Mathioudakis   4  
  • 5. example:  text  search   suppose  that  a  web  service  is  experiencing  errors   you  want  to  search  over  terabytes  of  logs  to  find  the  cause   the  logs  are  stored  in  Hadoop  Filesystem  (HDFS)   errors  are  wriUen  in  the  logs  as  lines  that     start  with  the  keyword  “ERROR”   Michael  Mathioudakis   5  
  • 6. example:  text  search   HDFS errors time fields map(_.split(‘t’)(3)) Figure 1: Lineage graph for the third query in our example. Boxes represent RDDs and arrows represent transformations. lines = spark.textFile("hdfs://...") errors = lines.filter(_.startsWith("ERROR")) errors.persist() Line 1 defines an RDD backed by an HDFS file (as a collection of lines of text), while line 2 derives a filtered RDD from it. Line 3 then asks for errors to persist in memory so that it can be shared across queries. Note that the argument to filter is Scala syntax for a closure. At this point, no work has been performed on the clus- ter. However, the user can now use the RDD in actions, e.g., to count the number of messages: F S m W p B e Ta 2. T m tr te a in lines = spark.textFile("hdfs://...") errors = lines.filter(_.startsWith("ERROR")) errors.persist() Line 1 defines an RDD backed by an HDFS file (as a collection of lines of text), while line 2 derives a filtered RDD from it. Line 3 then asks for errors to persist in memory so that it can be shared across queries. Note that the argument to filter is Scala syntax for a closure. At this point, no work has been performed on the clus- ter. However, the user can now use the RDD in actions, e.g., to count the number of messages: errors.count() The user can also perform further transformations on the RDD and use their results, as in the following lines: // Count errors mentioning MySQL: errors.filter(_.contains("MySQL")).count() e Ta 2. To m tri te a in bu gr w D m to in  Scala...   rdd   rdd   from  a  file   transforma6on   hint:  keep  in  memory!   no  work  on  the  cluster  so  far   ac6on!   lines  is  not  loaded  to  ram!   Michael  Mathioudakis   6  
  • 7. example  -­‐  text  search  ctd.   let  us  find  errors  related  to  “MySQL”   Michael  Mathioudakis   7  
  • 8. example  -­‐  text  search  ctd.   Figure 1: Lineage graph for the third query in our example. Boxes represent RDDs and arrows represent transformations. lines = spark.textFile("hdfs://...") errors = lines.filter(_.startsWith("ERROR")) errors.persist() Line 1 defines an RDD backed by an HDFS file (as a collection of lines of text), while line 2 derives a filtered RDD from it. Line 3 then asks for errors to persist in memory so that it can be shared across queries. Note that the argument to filter is Scala syntax for a closure. At this point, no work has been performed on the clus- ter. However, the user can now use the RDD in actions, e.g., to count the number of messages: errors.count() m W p B e Ta 2. T m tr te a in bu gr w collection of lines of text), while line 2 derives a filtered RDD from it. Line 3 then asks for errors to persist in memory so that it can be shared across queries. Note that the argument to filter is Scala syntax for a closure. At this point, no work has been performed on the clus- ter. However, the user can now use the RDD in actions, e.g., to count the number of messages: errors.count() The user can also perform further transformations on the RDD and use their results, as in the following lines: // Count errors mentioning MySQL: errors.filter(_.contains("MySQL")).count() // Return the time fields of errors mentioning // HDFS as an array (assuming time is field // number 3 in a tab-separated format): errors.filter(_.contains("HDFS")) 2. To m tri te a in bu gr w D m to R gr w to m memory so that it can be shared across queries. Note that the argument to filter is Scala syntax for a closure. At this point, no work has been performed on the clus- ter. However, the user can now use the RDD in actions, e.g., to count the number of messages: errors.count() The user can also perform further transformations on the RDD and use their results, as in the following lines: // Count errors mentioning MySQL: errors.filter(_.contains("MySQL")).count() // Return the time fields of errors mentioning // HDFS as an array (assuming time is field // number 3 in a tab-separated format): errors.filter(_.contains("HDFS")) .map(_.split(’t’)(3)) .collect() After the first action involving errors runs, Spark will memo tribut tems, a glob includ but a graine which DSM make tolera Th RDD graine writes to app more need be rec partit ure, a transforma6on   ac6on   Michael  Mathioudakis   8  
  • 9. example  -­‐  text  search  ctd.  again   let  us  find  errors  related  to  “HDFS”  and  extract   their  6me  field   assuming  6me  is  field  no.  3  in  tab-­‐separated  format   Michael  Mathioudakis   9  
  • 10. example  -­‐  text  search  ctd.  again   Figure 1: Lineage graph for the third query in our example. Boxes represent RDDs and arrows represent transformations. lines = spark.textFile("hdfs://...") errors = lines.filter(_.startsWith("ERROR")) errors.persist() Line 1 defines an RDD backed by an HDFS file (as a collection of lines of text), while line 2 derives a filtered RDD from it. Line 3 then asks for errors to persist in memory so that it can be shared across queries. Note that the argument to filter is Scala syntax for a closure. At this point, no work has been performed on the clus- ter. However, the user can now use the RDD in actions, e.g., to count the number of messages: errors.count() The user can also perform further transformations on the RDD and use their results, as in the following lines: Wo pla Beh eno Tab 2.3 To mem trib tem a gl incl but grai whi DSM mak collection of lines of text), while line 2 derives a filtered RDD from it. Line 3 then asks for errors to persist in memory so that it can be shared across queries. Note that the argument to filter is Scala syntax for a closure. At this point, no work has been performed on the clus- ter. However, the user can now use the RDD in actions, e.g., to count the number of messages: errors.count() The user can also perform further transformations on the RDD and use their results, as in the following lines: // Count errors mentioning MySQL: errors.filter(_.contains("MySQL")).count() // Return the time fields of errors mentioning // HDFS as an array (assuming time is field // number 3 in a tab-separated format): errors.filter(_.contains("HDFS")) .map(_.split(’t’)(3)) .collect() To mem tribu tem a gl incl but grai whi DSM mak tole T RDD grai writ to a mor need be r the argument to filter is Scala syntax for a closure. At this point, no work has been performed on the clus- ter. However, the user can now use the RDD in actions, e.g., to count the number of messages: errors.count() The user can also perform further transformations on the RDD and use their results, as in the following lines: // Count errors mentioning MySQL: errors.filter(_.contains("MySQL")).count() // Return the time fields of errors mentioning // HDFS as an array (assuming time is field // number 3 in a tab-separated format): errors.filter(_.contains("HDFS")) .map(_.split(’t’)(3)) .collect() After the first action involving errors runs, Spark will store the partitions of errors in memory, greatly speed- ing up subsequent computations on it. Note that the base RDD, lines, is not loaded into RAM. This is desirable tributed tems, ap a global include but also grained which p DSM is makes tolerant The m RDDs c grained writes t to appli more ef need to be reco partition ure, and nodes, w A sec At this point, no work has been performed on the clus- ter. However, the user can now use the RDD in actions, e.g., to count the number of messages: errors.count() The user can also perform further transformations on the RDD and use their results, as in the following lines: // Count errors mentioning MySQL: errors.filter(_.contains("MySQL")).count() // Return the time fields of errors mentioning // HDFS as an array (assuming time is field // number 3 in a tab-separated format): errors.filter(_.contains("HDFS")) .map(_.split(’t’)(3)) .collect() After the first action involving errors runs, Spark will store the partitions of errors in memory, greatly speed- tems a glo inclu but grain whic DSM mak toler Th RDD grain write to ap more need be re parti ure, node transforma6ons   ac6on   Michael  Mathioudakis   10  
  • 11. example:  text  search   lineage  of  6me  fields   lines errors filter(_.startsWith(“ERROR”)) HDFS errors time fields filter(_.contains(“HDFS”))) map(_.split(‘t’)(3)) Figure 1: Lineage graph for the third query in our example. Boxes represent RDDs and arrows represent transformations. lines = spark.textFile("hdfs://...") errors = lines.filter(_.startsWith("ERROR")) cached   pipelined   transforma6ons  if  a  par66on  of  errors  is  lost,   filter  is  applied  only  the   corresponding  par66on  of  lines   Michael  Mathioudakis   11  
  • 12. represen6ng  rdds   internal  informa6on  about  rdds     par66ons  &  par66oning  scheme   dependencies  on  parent  RDDs   func6on  to  compute  it  from  parents     Michael  Mathioudakis   12  
  • 13. rdd  dependencies   narrow  dependencies   each  par66on  of  the  parent  rdd  is  used  by   at  most  one  par66on  of  the  child  rdd     otherwise,  wide  dependencies   Michael  Mathioudakis   13  
  • 14. rdd  dependencies   union groupByKey join with inputs not co-partitioned join with inputs co-partitioned map, filter Narrow Dependencies: Wide Dependencies: Figure 4: Examples of narrow and wide dependencies. EachMichael  Mathioudakis   14  
  • 15. scheduling   when  an  ac6on  is  performed...   (e.g.,  count()  or  save())   ...  the  scheduler  examines  the  lineage  graph   builds  a  DAG  of  stages  to  execute     each  stage  is  a  maximal  pipeline  of   transforma6ons  over  narrow  dependencies   Michael  Mathioudakis   15  
  • 16. scheduling   join union groupBy map Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: G: Figure 5: Example of how Spark computes job stages. Boxes with solid outlines are RDDs. Partitions are shaded rectangles, in black if they are already in memory. To run an action on RDD rdd   par66on   already  in  ram   Michael  Mathioudakis   16  
  • 17. memory  management   when  not  enough  memory   apply  LRU  evic6on  policy  at  rdd  level   evict  par66on  from  least  recently  used  rdd   Michael  Mathioudakis   17  
  • 18. performance   logis6c  regression  and  k-­‐means   amazon  EC2   10  itera6ons  on  100GB  datasets   100  node-­‐clusters   Michael  Mathioudakis   18  
  • 19. performance   - e m - r - e n 80! 139! 46! 115! 182! 82! 76! 62! 3! 106! 87! 33! 0! 40! 80! 120! 160! 200! 240! Hadoop! HadoopBM! Spark! Hadoop! HadoopBM! Spark! Logistic Regression! K-Means! Iterationtime(s)! First Iteration! Later Iterations! Figure 7: Duration of the first and later iterations in Hadoop, HadoopBinMem and Spark for logistic regression and k-means using 100 GB of data on a 100-node cluster.Michael  Mathioudakis   19  
  • 20. performance   Example: Logistic Regression 0 500 1000 1500 2000 2500 3000 3500 4000 1 5 10 20 30 RunningTime(s) Number of Iterations Hadoop Spark 110 s / iteration first iteration 80 s further iterations 1 s logis6c  regression   2015   Michael  Mathioudakis   20  
  • 21. spark  programming   with  python   Michael  Mathioudakis   21  
  • 22. spark  programming   crea6ng  rdds   transforma6ons  &  ac6ons   lazy  evalua6on  &  persistency   passing  custom  func6ons   working  with  key-­‐value  pairs   data  par66oning   accumulators  &  broadcast  variables     pyspark   Michael  Mathioudakis   22  
  • 23. driver  program   contains  the  main  func6on  of  the  applica6on   defines  rdds   applies  opera6ons  on  them     e.g.,  the  spark  shell  itself  is  a  driver  program     driver  programs  access  spark  through  a  SparkContext  object   Michael  Mathioudakis   23  
  • 24. example   import pyspark sc = pyspark.SparkContext(master = "local", appName = "tour") text = sc.textFile(“myfile.txt”) # load data text.count() # count lines opera6on   create  rdd   this  part  is  assumed   from  now  on   if  we  are  running  on  a  cluster  on  machines,   different  machines  might  count  different  parts  of  the  file   SparkContext   automa6cally  created  in   spark  shell   Michael  Mathioudakis   24  
  • 25. example   text = sc.textFile("myfile.txt") # load data # keep only lines that mention "Spark" spark_lines = text.filter(lambda line: 'Spark' in line) spark_lines.count() # count lines opera6on  with  custom  func6on   on  a  cluster,  Spark  ships  the  func6on  to  all  workers   Michael  Mathioudakis   25  
  • 26. lambda  func6ons  in  python   f =( lambda line: 'Spark' in line ) f("we are learning Spark”) def f(line): return 'Spark' in line f("we are learning Spark") Michael  Mathioudakis   26  
  • 27. stopping   text = sc.textFile("myfile.txt") # load data # keep only lines that mention "Spark" spark_lines = text.filter(lambda line: 'Spark' in line) spark_lines.count() # count lines sc.stop() Michael  Mathioudakis   27  
  • 28. rdds   resilient  distributed  datasets     resilient     easy  to  recover   distributed   different  par66ons  materialize  on  different  nodes     read-­‐only  (immutable)   but  can  be  transformed  to  other  rdds   Michael  Mathioudakis   28  
  • 29. crea6ng  rdds   loading  an  external  dataset   text  =  sc.textFile("myfile.txt")   distribu6ng  a  collec6on  of  objects   data  =  sc.parallelize(  [0,1,2,3,4,5,6,7,8,9]  )   transforming  other  rdds   text_spark  =  text.filter(lambda  line:  'Spark'  in  line)     data_length  =  data.map(lambda  num:  num  **  2)   Michael  Mathioudakis   29  
  • 30. rdd  opera6ons   transforma6ons   return  a  new  rdd     ac6ons   extract  informa6on  from   an  rdd  or  save  it  to  disk   errorsRDD  =  inputRDD.filter(lambda  x:  "error"  in  x)   warningsRDD  =  inputRDD.filter(lambda  x:  "warning"  in  x)   badlinesRDD  =  errorsRDD.union(warningsRDD)   print("Input  had",  badlinesRDD.count(),  "concerning  lines.")   print("Here  are  some  of  them:")   for  line  in  badlinesRDD.take(10):          print(line)   inputRDD  =  sc.textFile(”logfile.txt")   Michael  Mathioudakis   30  
  • 31. rdd  opera6ons   def  is_prime(num):          if  num  <  1  or  num  %  1  !=  0:                  raise  Excep6on("invalid  argument")          for  d  in  range(2,  int(np.sqrt(num)  +  1)):                  if  num  %  d  ==  0:                          return  False          return  True   numbersRDD  =  sc.parallelize(list(range(1,  1000000)))     primesRDD  =  numbersRDD.filter(is_prime)     primes  =  primesRDD.collect()     print(primes[:100])   create  RDD   transforma6on   ac6on   opera6on  in  driver  what  if  primes  does   not  fit  in  memory?   transforma6ons   return  a  new  rdd     ac6ons   extract  informa6on  from   an  rdd  or  save  it  to  disk   Michael  Mathioudakis   31  
  • 32. rdd  opera6ons   def  is_prime(num):          if  num  <  1  or  num  %  1  !=  0:                  raise  Excep6on("invalid  argument")          for  d  in  range(2,  int(np.sqrt(num)  +  1)):                  if  num  %  d  ==  0:                          return  False          return  True   numbersRDD  =  sc.parallelize(list(range(1,  1000000)))     primesRDD  =  numbersRDD.filter(is_prime)     primesRDD.saveAsTextFile("primes.txt")   transforma6ons   return  a  new  rdd     ac6ons   extract  informa6on  from   an  rdd  or  save  it  to  disk   Michael  Mathioudakis   32  
  • 33. rdds   evaluated  lazily   ephemeral   can  persist  in  memory  (or  disk)  if  we  ask   Michael  Mathioudakis   33  
  • 34. lazy  evalua6on   numbersRDD  =  sc.parallelize(range(1,  1000000))     primesRDD  =  numbersRDD.filter(is_prime)     primesRDD.saveAsTextFile("primes.txt")   no  cluster  ac6vity  un6l  here   numbersRDD  =  sc.parallelize(range(1,  1000000))     primesRDD  =  numbersRDD.filter(is_prime)     primes  =  primesRDD.collect()     print(primes[:100])   no  cluster  ac6vity  un6l  here   Michael  Mathioudakis   34  
  • 35. persistence   RDDs  can  persist  in  memory,   if  we  ask  politely   numbersRDD  =  sc.parallelize(list(range(1,  1000000)))     primesRDD  =  numbersRDD.filter(is_prime)     primesRDD.persist()     primesRDD.count()     primesRDD.take(10)   RDD  already  in   memory   causes  RDD  to   materialize   Michael  Mathioudakis   35  
  • 36. persistence   why?   screenshot  from  jupyter  notebook   Michael  Mathioudakis   36  
  • 37. persistence   we  can  ask  Spark  to  maintain  rdds  on  disk   or  even  keep  replicas  on  different  nodes   data.persist(pyspark.StorageLevel(useDisk = True, useMemory = True, replication=2)) to  cease  persistence     data.unpersist() removes  rdd  from  memory  and  disk   Michael  Mathioudakis   37  
  • 38. passing  func6ons   lambda  func6ons   func6on  references   text = sc.textFile("myfile.txt") text_spark = text.filter(lambda line: 'Spark' in line) def f(line): return 'Spark' in line text = sc.textFile("myfile.txt") text_spark = text.filter(f) Michael  Mathioudakis   38  
  • 39. passing  func6ons   warning!   if  func6on  is  member  of  an  object  (self.method)  or   references  fields  of  an  object  (e.g.,  self.field)...   Spark  serializes  and  sends  the  en6re  object   to  worker  nodes   this  can  be  very  inefficient   Michael  Mathioudakis   39  
  • 40. passing  func6ons   class SearchFunctions(object): def __init__(self, query): self.query def is_match(self, s): return self.query in s def get_matches_in_rdd_v1(self, rdd): return rdd.filter(self.is_match) def get_matches_in_rdd_v2(self, rdd): return rdd.filter(lambda x: self.query in x) where  is  the  problem  in  the  code  below?   Michael  Mathioudakis   40  
  • 41. passing  func6ons   class SearchFunctions(object): def __init__(self, query): self.query def is_match(self, s): return self.query in s def get_matches_in_rdd_v1(self, rdd): return rdd.filter(self.is_match) def get_matches_in_rdd_v2(self, rdd): return rdd.filter(lambda x: self.query in x) where  is  the  problem  in  the  code  below?   reference  to  object  method   reference  to  object  field   Michael  Mathioudakis   41  
  • 42. passing  func6ons   beUer  implementa6on   class SearchFunctions(object): def __init__(self, query): self.query def is_match(self, s): return self.query in s def get_matches_in_rdd(self, rdd): query = self.query return rdd.filter(lambda x: query in x) Michael  Mathioudakis   42  
  • 43. common  rdd  opera6ons   element-­‐wise  transforma6ons   map  and  filter   inputRDD   {1,2,3,4}   mappedRDD   {1,2,3,4}   filteredRDD   {2,3,4}   .map(lambda  x:  x**2)   .filter(lambda  x:  x!=1)   map’s  return  type  can  be  different  that  its  input’s   Michael  Mathioudakis   43  
  • 44. common  rdd  opera6ons   element-­‐wise  transforma6ons   produce  mul6ple  elements  per  input  element   flatMap   phrases = sc.parallelize(["hello world", "how are you", "how do you do"]) words = phrases.flatMap(lambda phrase: phrase.split(" ")) words.count() 9   Michael  Mathioudakis   44  
  • 45. common  rdd  opera6ons   phrases = sc.parallelize(["hello world", "how are you", "how do you do"]) words = phrases.flatMap(lambda phrase: phrase.split(" ")) words.collect() phrases = sc.parallelize(["hello world", "how are you", "how do you do"]) words = phrases.map(lambda phrase: phrase.split(" ")) words.collect() how  is  the  result  different?   [['hello',  'world'],  ['how',  'are',  'you'],  ['how',  'do',  'you',  'do']]   ['hello',  'world',  'how',  'are',  'you',  'how',  'do',  'you',  'do']   Michael  Mathioudakis   45  
  • 46. common  rdd  opera6ons   oneRDD = sc.parallelize([1, 1, 1, 2, 3, 3, 4, 4]) oneRDD.persist() otherRDD = sc.parallelize([1, 4, 4, 7]) otherRDD.persist() (pseudo)  set  opera6ons   Michael  Mathioudakis   46  
  • 47. common  rdd  opera6ons   (pseudo)  set  opera6ons   union   oneRDD = sc.parallelize([1, 1, 1, 2, 3, 3, 4, 4]) oneRDD.persist() otherRDD = sc.parallelize([1, 4, 4, 7]) otherRDD.persist() oneRDD.union(otherRDD).collect() [1,  1,  1,  2,  3,  3,  4,  4,  1,  4,  4,  7]   Michael  Mathioudakis   47  
  • 48. common  rdd  opera6ons   subtrac6on   oneRDD.subtract(otherRDD).collect() [2,  3,  3]   (pseudo)  set  opera6ons   oneRDD = sc.parallelize([1, 1, 1, 2, 3, 3, 4, 4]) oneRDD.persist() otherRDD = sc.parallelize([1, 4, 4, 7]) otherRDD.persist() Michael  Mathioudakis   48  
  • 49. common  rdd  opera6ons   duplicate  removal   oneRDD.distinct().collect() [1,  2,  3,  4]   (pseudo)  set  opera6ons   oneRDD = sc.parallelize([1, 1, 1, 2, 3, 3, 4, 4]) oneRDD.persist() otherRDD = sc.parallelize([1, 4, 4, 7]) otherRDD.persist() Michael  Mathioudakis   49  
  • 50. common  rdd  opera6ons   intersec6on   oneRDD.intersection(otherRDD).collect() [1,  4]   (pseudo)  set  opera6ons   oneRDD = sc.parallelize([1, 1, 1, 2, 3, 3, 4, 4]) oneRDD.persist() otherRDD = sc.parallelize([1, 4, 4, 7]) otherRDD.persist() removes  duplicates!   Michael  Mathioudakis   50  
  • 51. common  rdd  opera6ons   cartesian  product   oneRDD.cartesian(otherRDD).collect()[:5] [(1,  1),  (1,  4),  (1,  4),  (1,  7),  (1,  1)]   (pseudo)  set  opera6ons   oneRDD = sc.parallelize([1, 1, 1, 2, 3, 3, 4, 4]) oneRDD.persist() otherRDD = sc.parallelize([1, 4, 4, 7]) otherRDD.persist() Michael  Mathioudakis   51  
  • 52. common  rdd  opera6ons   (pseudo)  set  opera6ons   union   subtrac6on   duplicate  removal   intersec6on   cartesian  product   big  difference  in  implementa6on   (and  efficiency)   par66on  shuffling   yes  no   Michael  Mathioudakis   52  
  • 53. sortBy   common  rdd  opera6ons   how  is  sortBy  implemented?   we’ll  see  later...   data = sc.parallelize(np.random.rand(10)) data.sortBy(lambda x: x) Michael  Mathioudakis   53  
  • 54. common  rdd  opera6ons   ac6ons   reduce   successively  operates  on  two  elements  of  rdd   returns  new  element  of  same  type   data.reduce(lambda x, y: x + y) 181   data = sc.parallelize([1,43,62,23,52]) data.reduce(lambda x, y: x * y) 3188536   commuta6ve   &  associa6ve   func6ons   Michael  Mathioudakis   54  
  • 55. commuta6ve  &  associa6ve     func6on  f(x,  y)   e.g.,  add(x,  y)  =  x  +  y     commuta6ve   f(x,  y)  =  f(y,  x)   e.g.,  add(x,  y)  =  x  +  y  =  y  +  x  =  add(y,  x)     associa6ve   f(x,  f(y,  z))  =  f((x,  y),  z)   e.g.,  add(x,  add(y,  z))  =  x  +  (y  +  z)  =  (x  +  y)  +  z  =  add(add(x,  y),  z)     Michael  Mathioudakis   55  
  • 56. common  rdd  opera6ons   ac6ons   data = sc.parallelize([1,43,62,23,52]) data.reduce(lambda x, y: x**2 + y**2) 137823683725010149883130929   compute  sum  of  squares  of  data   is  the  following  correct?   no  -­‐  why?   reduce   successively  operates  on  two  elements  of  rdd   produces  single  aggregate   56  
  • 57. common  rdd  opera6ons   ac6ons   data = sc.parallelize([1,43,62,23,52]) data.reduce(lambda x, y: np.sqrt(x**2 + y**2)) ** 2 8927.0   yes  -­‐  why?   compute  sum  of  squares  of  data   is  the  following  correct?   reduce   successively  operates  on  two  elements  of  rdd   produces  single  aggregate   Michael  Mathioudakis   57  
  • 58. common  rdd  opera6ons   ac6ons   aggregate   generalizes  reduce   the  user  provides   a  zero  value   the  iden6ty  element  for  aggrega6on   a  sequen6al  opera6on  (func6on)   to  update  aggrega6on  for  one  more  element  in  one  par66on     a  combining  opera6on  (func6on)   to  combine  aggregates  from  different  par66ons   Michael  Mathioudakis   58  
  • 59. common  rdd  opera6ons   ac6ons   data = sc.parallelize([1,43,62,23,52]) aggr = data.aggregate(zeroValue = (0,0), seqOp = (lambda x, y: (x[0] + y, x[1] + 1)), combOp = (lambda x, y: (x[0] + y[0], x[1] + y[1]))) aggr[0] / aggr[1] what  does  the  following  compute?   zero   value   sequen6al   opera6on   combining   opera6on   the  average  value  of  data   aggregate   generalizes  reduce   Michael  Mathioudakis   59  
  • 60. common  rdd  opera6ons   ac6ons   opera5on   return   collect   all  elements   take(num)   num  elements   tries  to  minimize  disk  access  (e.g.,  by  accessing  one   par66on)   takeSample   a  random  sample  of  elements   count   number  of  elements   countByValue   number  of  6me  each  element  appears   first  on  each  par66on,  then  combines  par66on  results   top(num)   num  maximum  elements   sorts  par66ons  and  merges   Michael  Mathioudakis   60  
  • 61. all  opera6ons  we  have  described   so  far  apply  to  all  rdds     that’s  why  the  word  “common”  has  been  in  the  6tle   common  rdd  opera6ons   Michael  Mathioudakis   61  
  • 62. pair  rdds   elements  are  key-­‐value  pairs   pairRDD = sc.parallelize(range(100)).map(lambda x: (x, x**2)) pairRDD.collect()[:5] [(0,  0),  (1,  1),  (2,  4),  (3,  9),  (4,  16)]   they  come  from  mapreduce  model   prac6cal  in  many  cases   spark  provides  opera6ons  tailored  to  pair  rdds   Michael  Mathioudakis   62  
  • 63. transforma6ons  on  pair  rdds   pairRDD = sc.parallelize(range(100)).map(lambda x: (x, x**2)) keys  and  values   pairRDD.keys().collect()[:5] [0,  1,  2,  3,  4]   pairRDD.values().collect()[:5] [0,  1,  4,  9,  16]   Michael  Mathioudakis   63  
  • 64. transforma6ons  on  pair  rdds   reduceByKey   pairRDD = sc.parallelize([ ('$APPL', 100.64), ('$GOOG', 706.2), ('$AMZN', 552.32), ('$APPL', 100.52), ('$AMZN', 552.32) ]) volumePerKey = pairRDD.reduceByKey(lambda x, y: x + y) volumePerKey.collect() [('$APPL',  201.16),  ('$AMZN',  1104.64),  ('$GOOG',  706.2)]   reduceByKey  is  a  transforma6on   unlike  reduce   Michael  Mathioudakis   64  
  • 65. transforma6ons  on  pair  rdds   combineByKey   generalizes  reduceByKey   user  provides   createCombiner  func6on   provides  the  zero  value  for  each  key   mergeValue  func6on   combines  current  aggregate  in  one  par66on  with  new  value   mergeCombiner   to  combine  aggregates  from  par66ons   Michael  Mathioudakis   65  
  • 66. combineByKey   generalizes  reduceByKey   pairRDD = sc.parallelize([ ('$APPL', 100.64), ('$GOOG', 706.2), ('$AMZN', 552.32), ('$APPL', 100.52), ('$AMZN', 552.32) ]) aggr = pairRDD.combineByKey(createCombiner = lambda x: (x, 1), mergeValue = lambda x, y: (x[0] + y, x[1] + 1), mergeCombiners = lambda x, y: (x[0] + y[0], x[1] + y[1])) avgPerKey = aggr.map(lambda x: (x[0], x[1][0]/x[1][1])) avgPerKey.collect() what  does  the  following  produce   transforma6ons  on  pair  rdds   Michael  Mathioudakis   66  
  • 67. transforma6ons  on  pair  rdds   sortByKey     samples  values  from  rdd  to   es6mate  sorted  par66on  boundaries   shuffles  data   sorts  by  external  sor6ng     used  to  implement  common  sortBy   idea:  create  a  pair  RDD  with  (sort-­‐key,  item)  elements   apply  sortByKey  on  that   Michael  Mathioudakis   67  
  • 68. transforma6ons  on  pair  rdds   implemented  with  a  variant  of  hash  join   spark  also  has  func6ons  for   le|  outer  join,  right  outer  join,  full  outer  join   (inner)  join   Michael  Mathioudakis   68  
  • 69. transforma6ons  on  pair  rdds   (inner)  join   course_a = sc.parallelize([ ("Antti", 8), ("Tuukka", 10), (“Leena", 9)]) course_b = sc.parallelize([ ("Leena", 10), ("Tuukka", 10)]) result = course_a.join(course_b) result.collect() [('Tuukka',  (10,  10)),  ('Leena',  (9,  10))]   Michael  Mathioudakis   69  
  • 70. transforma6ons  on  pair  rdds   other  transforma6ons   groupByKey   pairRDD = sc.parallelize([ ('$APPL', 100.64), ('$GOOG', 706.2), ('$AMZN', 552.32), ('$APPL', 100.52), ('$AMZN', 552.32) ]) pairRDD.groupByKey().collect() [('$APPL',  <  values  >),    ('$AMZN',  <  values  >),    ('$GOOG',  <  values  >)]   for  grouping  together  mul6ple  rdds   cogroup  and  groupWith   Michael  Mathioudakis   70  
  • 71. ac6ons  on  pair  rdds   lookup(key)   pairRDD = sc.parallelize([ ('$APPL', 100.64), ('$GOOG', 706.2), ('$AMZN', 552.32), ('$APPL', 100.52), ('$AMZN', 552.32) ]) countByKey   {'$AMZN':  2,  '$APPL':  2,  '$GOOG':  1}   pairRDD.countByKey() collectAsMap   pairRDD.collectAsMap() {'$AMZN':  552.32,  '$APPL':  100.52,  '$GOOG':  706.2}   pairRDD.lookup("$APPL") [100.64,  100.52]   Michael  Mathioudakis   71  
  • 72. shared  variables   accumulators   write-­‐only  for  workers     broadcast  variables   read-­‐only  for  workers   Michael  Mathioudakis   72  
  • 73. accumulators   text = sc.textFile("myfile.txt") long_lines = sc.accumulator(0) def line_len(line): global long_lines length = len(line) if length > 30: long_lines += 1 return length llengthRDD = text.map(line_len) llengthRDD.count() 95   long_lines.value 45   lazy!   Michael  Mathioudakis   73  
  • 74. accumulators   fault  tolerance     spark  executes  updates  in  ac6ons  only  once   e.g.,  foreach()   foreach:  special  ac6on     this  is  not  guaranteed  for  transforma=ons   in  transforma6ons,  use  accumulators   only  for  debugging  purposes!   Michael  Mathioudakis   74  
  • 75. accumulators  +  foreach   text = sc.textFile("myfile.txt") long_lines = sc.accumulator(0) def line_len(line): global long_lines length = len(line) if length > 30: long_lines += 1 text.foreach(line_len) long_lines.value 45   Michael  Mathioudakis   75  
  • 76. broadcast  variables     sent  to  workers  only  once   read-­‐only   even  if  you  change  its  value  on  a  worker,   the  change  does  not  propagate  to  other  workers   (actually  broadcast  object  is  wriUen  to  file,  read   from  there  by  each  worker)     release  with  unpersist()   Michael  Mathioudakis   76  
  • 77. broadcast  variables   def load_address_table(): return {"Anu": "Chem. A143", "Karmen": "VTT, 74", "Michael": "OIH, B253.2", "Anwar": "T, B103", "Orestis": "T, A341", "Darshan": "T, A325"} address_table = sc.broadcast(load_address_table()) def find_address(name): res = None if name in address_table.value: res = address_table.value[name] return res data = sc.parallelize(["Anwar", "Michael", "Orestis", "Darshan"]) pairRDD = data.map(lambda name: (name, find_address(name))) pairRDD.collectAsMap() Michael  Mathioudakis   77  
  • 78. par66oning   certain  opera6ons  take  advantage   of  par66oning   e.g.,  reduceByKey,  join   anRDD.par66onBy(numPar66ons,  par66onFunc)   users  can  set  number  of  par66ons  and  par66oning  func6on   Michael  Mathioudakis   78  
  • 79. working  on  per-­‐par66on  basis   spark  provides  opera6ons  that  operate   at  par66on  level   e.g.,  mapPar66on   rdd = sc.parallelize(range(100), 4) def f(iterator): yield sum(iterator) rdd.mapPartitions(f).collect() used  in  implementa6on  of  Spark   Michael  Mathioudakis   79  
  • 80. see  implementa6on  of  spark  on   hUps://github.com/apache/spark/       Michael  Mathioudakis   80  
  • 81. references     1.  Zaharia,  Matei,  et  al.  "Spark:  Cluster  Compu6ng  with  Working  Sets."   HotCloud  10  (2010):  10-­‐10.   2.  Zaharia,  Matei,  et  al.  "Resilient  distributed  datasets:  A  fault-­‐tolerant   abstrac6on  for  in-­‐memory  cluster  compu6ng."  Proceedings  of  the  9th   USENIX  conference  on  Networked  Systems  Design  and  Implementa=on.   3.  Learning  Spark:  Lightning-­‐Fast  Big  Data  Analysis,  by  Holden  Karau,  Andy   Konwinski,  Patrick  Wendell,  Matei  Zaharia   4.  Spark  programming  guide   hUps://spark.apache.org/docs/latest/programming-­‐guide.html   5.  Spark  implementa6on  hUps://github.com/apache/spark/   6.  "Making  Big  Data  Processing  Simple  with  Spark,"  Matei  Zaharia,   hUps://youtu.be/d9D-­‐Z3-­‐44F8       Michael  Mathioudakis   81