Spark	  Making	  Big	  Data	  Analytics	  Interactive	  and	  Real-­‐Time	  Matei	  Zaharia,	  in	  collaboration	  with	 ...
Overview	  Spark	  is	  a	  parallel	  framework	  that	  provides:	    » Efficient	  primitives	  for	  in-­‐memory	  data	...
Motivation	  MapReduce	  simplified	  data	  analysis	  on	  large,	  unreliable	  clusters	  But	  as	  soon	  as	  it	  g...
Motivation	  Complex	  apps,	  streaming,	  and	  interactive	  queries	  all	  need	  one	  thing	  that	  MapReduce	  la...
Examples	                HDFS	                      HDFS	                                HDFS	                      HDFS	 ...
Goal:	  Sharing	  at	  Memory	  Speed	                          iter.	  1	           iter.	  2	                 .	  	  .	 ...
Challenge	  How	  to	  design	  a	  distributed	  memory	  abstraction	     that	  is	  both	  fault-­‐tolerant	  and	  effi...
Existing	  Systems	  Existing	  in-­‐memory	  storage	  systems	  have	  interfaces	  based	  on	  fine-­‐grained	  updates...
Solution:	  Resilient	  Distributed	  Datasets	  (RDDs)	  Provide	  an	  interface	  based	  on	  coarse-­‐grained	  opera...
RDD	  Recovery	                        iter.	  1	     iter.	  2	                  .	  	  .	  	  .	    Input	              ...
Generality	  of	  RDDs	  RDDs	  can	  express	  surprisingly	  many	  parallel	  algorithms	   » These	  naturally	  apply...
Outline	  Programming	  interface	  Examples	  User	  applications	  Implementation	  Demo	  Current	  research:	  Spark	 ...
Spark	  Programming	  Interface	  Language-­‐integrated	  API	  in	  Scala*	  Provides:	    » Resilient	  distributed	  da...
Example:	  Log	  Mining	   Load	  error	  messages	  from	  a	  log	  into	  memory,	  then	   interactively	  search	  fo...
Fault	  Recovery	  RDDs	  track	  the	  graph	  of	  transformations	  that	  built	  them	  (their	  lineage)	  to	  rebu...
Fault	  Recovery	  Results	                                   140	                                             119	       ...
Example:	  Logistic	  Regression	  Goal:	  find	  best	  line	  separating	  two	  sets	  of	  points	                     ...
Example:	  Logistic	  Regression	  val data = spark.textFile(...).map(readPoint).persist()var w = Vector.random(D)for (i <...
Logistic	  Regression	  Performance	                                 60	                                                  ...
Example:	  Collaborative	  Filtering	  Goal:	  predict	  users’	  movie	  ratings	  based	  on	  past	  ratings	  of	  oth...
Model	  and	  Algorithm	  Model	  R	  as	  product	  of	  user	  and	  movie	  feature	  matrices	  A	  and	  B	  of	  siz...
Serial	  ALS	  var R = readRatingsMatrix(...)var A = // array of U random vectorsvar B = // array of M random vectorsfor (...
Naïve	  Spark	  ALS	  var R = readRatingsMatrix(...)var A = // array of U random vectorsvar B = // array of M random vecto...
Efficient	  Spark	  ALS	  var R = spark.broadcast(readRatingsMatrix(...))    Solution:	                                     ...
Scaling	  Up	  Broadcast	                                             Initial	  version	  (HDFS)                          ...
Other	  RDD	  Operations	                                           map	                      flatMap	                     ...
Spark	  in	  Java	  lines.filter(_.contains(“error”)).count()JavaRDD<String> lines = sc.textFile(...);lines.filter(new Fun...
Spark	  in	  Python	  (Coming	  Soon!)	  lines = sc.textFile(sys.argv[1])counts = lines.flatMap(lambda x: x.split( ))     ...
Outline	  Programming	  interface	  Examples	  User	  applications	  Implementation	  Demo	  Current	  research:	  Spark	 ...
Spark	  Users	      400+	  user	  meetup,	  20+	  contributors	  
User	  Applications	  Crowdsourced	  traffic	  estimation	  (Mobile	  Millennium)	  Video	  analytics	  &	  anomaly	  detect...
Mobile	  Millennium	  Project	  Estimate	  city	  traffic	  from	  GPS-­‐equipped	  vehicles	  (e.g.	  SF	  taxis)	  
Sample	  Data	       Credit:	  Tim	  Hunter,	  with	  support	  of	  the	  Mobile	  Millennium	  team;	  P.I.	  Alex	  Bay...
Challenge	  Data	  is	  noisy	  and	  sparse	  (1	  sample/minute)	  Must	  infer	  path	  taken	  by	  each	  vehicle	  i...
Challenge	  Data	  is	  noisy	  and	  sparse	  (1	  sample/minute)	  Must	  infer	  path	  taken	  by	  each	  vehicle	  i...
Solution	  EM	  algorithm	  to	  estimate	  paths	  and	  travel	  time	  distributions	  simultaneously	                 ...
Results	                            [Hunter	  et	  al,	  SOCC	  2011]	   3×	  speedup	  from	  caching,	  4.5x	  from	  br...
Conviva	  GeoReport	     Hive	                                                   20	   Spark	              0.5	           ...
Other	  Programming	  Models	  Pregel	  on	  Spark	  (Bagel)	    » 200	  lines	  of	  code	  Iterative	  MapReduce	    » 2...
Outline	  Programming	  interface	  Examples	  User	  applications	  Implementation	  Demo	  Current	  research:	  Spark	 ...
Implementation	  Runs	  on	  Apache	  Mesos	  cluster	                                                    Spark	   Hadoop	...
Task	  Scheduler	  Runs	  general	  DAGs	                 A:	                              B:	  Pipelines	  functions	    ...
Language	  Integration	  Scala	  closures	  are	  Serializable	  Java	  objects	         » Serialize	  on	  master,	  load...
Interactive	  Spark	  Modified	  Scala	  interpreter	  to	  allow	  Spark	  to	  be	  used	  interactively	  from	  the	  c...
Outline	  Programming	  interface	  Examples	  User	  applications	  Implementation	  Demo	  Current	  research:	  Spark	 ...
Outline	  Programming	  interface	  Examples	  User	  applications	  Implementation	  Demo	  Current	  research:	  Spark	 ...
Motivation	  Many	  “big	  data”	  apps	  need	  to	  work	  in	  real	  time	    » Site	  statistics,	  spam	  filtering,	...
Traditional	  Streaming	  Systems	  Continuous	  processing	  model	   » Each	  node	  has	  long-­‐lived	  state	   » For...
Traditional	  Streaming	  Systems	        Fault	  tolerance	  via	  replication	  or	  upstream	  backup:	  input	      no...
Traditional	  Streaming	  Systems	        Fault	  tolerance	  via	  replication	  or	  upstream	  backup:	  input	        ...
Traditional	  Streaming	  Systems	        Fault	  tolerance	  via	  replication	  or	  upstream	  backup:	  input	      no...
Observation	  Batch	  processing	  models,	  such	  as	  MapReduce,	  do	  provide	  fault	  tolerance	  efficiently	    » D...
Discretized	  Stream	  Processing	                                                      batch	  operation	        t	  =	  ...
Fault	  Recovery	  Checkpoint	  state	  RDDs	  periodically	  If	  a	  node	  fails/straggles,	  rebuild	  lost	  RDD	  pa...
How	  Fast	  Can	  It	  Go?	  Can	  process	  over	  60M	  records/s	  (6	  GB/s)	  on	  100	  nodes	  at	  sub-­‐second	 ...
Comparison	  with	  Storm	                                               Grep	                                            ...
How	  Fast	  Can	  It	  Recover?	  Recovers	  from	  faults/stragglers	  within	  1	  second	                             ...
Programming	  Interface	  Extension	  to	  Spark:	  Spark	  Streaming	         » All	  Spark	  operators	  plus	  new	  “s...
Incremental	  Operators	  words.reduceByWindow(“5s”, max)                            words.reduceByWindow(“5s”, _+_, _-_) ...
Applications	                                                                                                             ...
Unifying	  Streaming	  and	  Batch	  D-­‐streams	  and	  RDDs	  can	  seamlessly	  be	  combined	         » Same	  executi...
Benefits	  of	  a	  Unified	  Stack	  Write	  each	  algorithm	  only	  once	  Reuse	  data	  across	  streaming	  &	  batch...
Conclusion	  “Big	  data”	  is	  moving	  beyond	  one-­‐pass	  batch	  jobs,	  to	  low-­‐latency	  apps	  that	  need	  ...
Related	  Work	  DryadLINQ,	  FlumeJava	    » Similar	  “distributed	  collection”	  API,	  but	  cannot	  reuse	      dat...
Behavior	  with	  Not	  Enough	  RAM	                                 100	                                                ...
RDDs	  for	  Debugging	  Debugging	  general	  distributed	  apps	  is	  very	  hard	  However,	  Spark	  and	  other	  re...
Upcoming SlideShare
Loading in...5
×

Making Big Data Analytics Interactive and Real-­Time

6,662

Published on

Published in: Technology

Making Big Data Analytics Interactive and Real-­Time

  1. 1. Spark  Making  Big  Data  Analytics  Interactive  and  Real-­‐Time  Matei  Zaharia,  in  collaboration  with  Mosharaf  Chowdhury,  Tathagata  Das,  Timothy  Hunter,  Ankur  Dave,  Haoyuan  Li,  Justin  Ma,  Murphy  McCauley,  Michael  Franklin,  Scott  Shenker,  Ion  Stoica    spark-­‐project.org   UC  BERKELEY  
  2. 2. Overview  Spark  is  a  parallel  framework  that  provides:   » Efficient  primitives  for  in-­‐memory  data  sharing   » Simple  APIs  in  Scala,  Java,  SQL   » High  generality  (applicable  to  many  emerging  problems)  This  talk  will  cover:   » What  it  does   » How  people  are  using  it  (including  some  surprises)   » Current  research  
  3. 3. Motivation  MapReduce  simplified  data  analysis  on  large,  unreliable  clusters  But  as  soon  as  it  got  popular,  users  wanted  more:   » More  complex,  multi-­‐pass  applications  (e.g.   machine  learning,  graph  algorithms)   » More  interactive  ad-­‐hoc  queries   » More  real-­‐time  stream  processing   One  reaction:  specialized  models  for  some  of   these  apps  (e.g.  Pregel,  Storm)  
  4. 4. Motivation  Complex  apps,  streaming,  and  interactive  queries  all  need  one  thing  that  MapReduce  lacks:   Efficient  primitives  for  data  sharing    
  5. 5. Examples   HDFS   HDFS   HDFS   HDFS   read   write   read   write   iter.  1   iter.  2   .    .    .   Input   HDFS   query  1   result  1   read   query  2   result  2   query  3   result  3   Input   .    .    .   Slow  due  to  replication  and  disk  I/O,   but  necessary  for  fault  tolerance  
  6. 6. Goal:  Sharing  at  Memory  Speed   iter.  1   iter.  2   .    .    .   Input   query  1   one-­‐time   processing   query  2   query  3   Input   .    .    .  10-­‐100×  faster  than  network/disk,  but  how  to  get  FT?  
  7. 7. Challenge  How  to  design  a  distributed  memory  abstraction   that  is  both  fault-­‐tolerant  and  efficient?  
  8. 8. Existing  Systems  Existing  in-­‐memory  storage  systems  have  interfaces  based  on  fine-­‐grained  updates   » Reads  and  writes  to  cells  in  a  table   » E.g.  databases,  key-­‐value  stores,  distributed  memory  Requires  replicating  data  or  logs  across  nodes  for  fault  tolerance  è  expensive!   » 10-­‐100x  slower  than  memory  write…  
  9. 9. Solution:  Resilient  Distributed  Datasets  (RDDs)  Provide  an  interface  based  on  coarse-­‐grained  operations  (map,  group-­‐by,  join,  …)  Efficient  fault  recovery  using  lineage   » Log  one  operation  to  apply  to  many  elements   » Recompute  lost  partitions  on  failure   » No  cost  if  nothing  fails  
  10. 10. RDD  Recovery   iter.  1   iter.  2   .    .    .   Input   query  1   one-­‐time   processing   query  2   query  3   Input   .    .    .  
  11. 11. Generality  of  RDDs  RDDs  can  express  surprisingly  many  parallel  algorithms   » These  naturally  apply  same  operation  to  many  items  Capture  many  current  programming  models   » Data  flow  models:  MapReduce,  Dryad,  SQL,  …   » Specialized  models  for  iterative  apps:  Pregel,  iterative   MapReduce,  PowerGraph,  …  Allow  these  models  to  be  composed  
  12. 12. Outline  Programming  interface  Examples  User  applications  Implementation  Demo  Current  research:  Spark  Streaming  
  13. 13. Spark  Programming  Interface  Language-­‐integrated  API  in  Scala*  Provides:   » Resilient  distributed  datasets  (RDDs)   •  Partitioned  collections  with  controllable  caching   » Operations  on  RDDs   •  Transformations  (define  RDDs),  actions  (compute  results)   » Restricted  shared  variables  (broadcast,  accumulators)  *Also  Java,  SQL  and  soon  Python  
  14. 14. Example:  Log  Mining   Load  error  messages  from  a  log  into  memory,  then   interactively  search  for  various  patterns   Base  RDD   Msgs.  1  lines = spark.textFile(“hdfs://...”) Transformed  RDD   Worker   results  errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘t’)(2)) tasks   Block  1   Driver  cachedMsgs = messages.persist() Action  cachedMsgs.filter(_.contains(“foo”)).countcachedMsgs.filter(_.contains(“bar”)).count Msgs.  2   Worker  . . . Msgs.  3   Worker   Block  2   Result:  sull-­‐text  s1  TB  data  in  5-­‐7  sec   fcaled  to   earch  of  Wikipedia   in  <1  sec  (vs  ec  for  on-­‐disk  data)  ata)   (vs  170  s 20  sec  for  on-­‐disk  d Block  3  
  15. 15. Fault  Recovery  RDDs  track  the  graph  of  transformations  that  built  them  (their  lineage)  to  rebuild  lost  data  E.g.:   messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘t’)(2))     HadoopRDD     FilteredRDD     MappedRDD     path  =  hdfs://…   func  =  _.contains(...)   func  =  _.split(…)  
  16. 16. Fault  Recovery  Results   140   119   Failure  happens   Iteratrion  time  (s)   120   100   81   80   57   56   58   58   57   59   57   59   60   40   20   0   1   2   3   4   5   6   7   8   9   10   Iteration  
  17. 17. Example:  Logistic  Regression  Goal:  find  best  line  separating  two  sets  of  points   random  initial  line   + + ++ + – – + + + – – –– + + – – – – target  
  18. 18. Example:  Logistic  Regression  val data = spark.textFile(...).map(readPoint).persist()var w = Vector.random(D)for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient}println("Final w: " + w)
  19. 19. Logistic  Regression  Performance   60   110  s  /  iteration  Running  Time  (min)   50   40   Hadoop   30   Spark   20   10   first  iteration  80  s   0   further  iterations  6  s   1   10   20   30   Number  of  Iterations  
  20. 20. Example:  Collaborative  Filtering  Goal:  predict  users’  movie  ratings  based  on  past  ratings  of  other  movies   1  ?  ?  4  5  ?  3   ?  ?  3  5  ?  ?  3   R  =   Users   5  ?  5  ?  ?  ?  1   4  ?  ?  ?  ?  2  ?   Movies  
  21. 21. Model  and  Algorithm  Model  R  as  product  of  user  and  movie  feature  matrices  A  and  B  of  size  U×K  and  M×K   BT   R   =   A  Alternating  Least  Squares  (ALS)   » Start  with  random  A  &  B   » Optimize  user  vectors  (A)  based  on  movies   » Optimize  movie  vectors  (B)  based  on  users   » Repeat  until  converged  
  22. 22. Serial  ALS  var R = readRatingsMatrix(...)var A = // array of U random vectorsvar B = // array of M random vectorsfor (i <- 1 to ITERATIONS) { A = (0 until U).map(i => updateUser(i, B, R)) B = (0 until M).map(i => updateMovie(i, A, R))} Range  objects  
  23. 23. Naïve  Spark  ALS  var R = readRatingsMatrix(...)var A = // array of U random vectorsvar B = // array of M random vectorsfor (i <- 1 to ITERATIONS) { A = spark.parallelize(0 until U, numSlices) Problem:   .map(i => updateUser(i, B, R)) R  re-­‐sent   .collect() B = spark.parallelize(0 until M, numSlices) to  all  nodes   .map(i => updateMovie(i, A, R)) in  each   .collect() iteration  }
  24. 24. Efficient  Spark  ALS  var R = spark.broadcast(readRatingsMatrix(...)) Solution:   mark  R  as  var A = // array of U random vectors broadcast  var B = // array of M random vectors variable  for (i <- 1 to ITERATIONS) { A = spark.parallelize(0 until U, numSlices) .map(i => updateUser(i, B, R.value)) .collect() B = spark.parallelize(0 until M, numSlices) .map(i => updateMovie(i, A, R.value)) .collect()} Result:  3×  performance  improvement  
  25. 25. Scaling  Up  Broadcast   Initial  version  (HDFS)   Cornet  P2P  broadcast   250   250   Communication   Communication   200   Computation   200   Computation  Iteration  time  (s)   Iteration  time  (s)   150   150   100   100   50   50   0   0   10   30   60   90   10   30   60   90   Number  of  machines   Number  of  machines   [Chowdhury  et  al,  SIGCOMM  2011]  
  26. 26. Other  RDD  Operations   map   flatMap   filter   union   Transformations   sample   join  (define  a  new  RDD)   groupByKey   cogroup   reduceByKey   cross   sortByKey   …   collect   Actions   reduce   (return  a  result  to   count   driver  program)   save   …  
  27. 27. Spark  in  Java  lines.filter(_.contains(“error”)).count()JavaRDD<String> lines = sc.textFile(...);lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); }}).count();
  28. 28. Spark  in  Python  (Coming  Soon!)  lines = sc.textFile(sys.argv[1])counts = lines.flatMap(lambda x: x.split( )) .map(lambda x: (x, 1)) .reduceByKey(lambda x, y: x + y)
  29. 29. Outline  Programming  interface  Examples  User  applications  Implementation  Demo  Current  research:  Spark  Streaming  
  30. 30. Spark  Users   400+  user  meetup,  20+  contributors  
  31. 31. User  Applications  Crowdsourced  traffic  estimation  (Mobile  Millennium)  Video  analytics  &  anomaly  detection  (Conviva)  Ad-­‐hoc  queries  from  web  app  (Quantifind)  Twitter  spam  classification  (Monarch)  DNA  sequence  analysis  (SNAP)  …  
  32. 32. Mobile  Millennium  Project  Estimate  city  traffic  from  GPS-­‐equipped  vehicles  (e.g.  SF  taxis)  
  33. 33. Sample  Data   Credit:  Tim  Hunter,  with  support  of  the  Mobile  Millennium  team;  P.I.  Alex  Bayen;  traffic.berkeley.edu  
  34. 34. Challenge  Data  is  noisy  and  sparse  (1  sample/minute)  Must  infer  path  taken  by  each  vehicle  in  addition  to  travel  time  distribution  on  each  link  
  35. 35. Challenge  Data  is  noisy  and  sparse  (1  sample/minute)  Must  infer  path  taken  by  each  vehicle  in  addition  to  travel  time  distribution  on  each  link  
  36. 36. Solution  EM  algorithm  to  estimate  paths  and  travel  time  distributions  simultaneously   observations   flatMap   weighted  path  samples   groupByKey   link  parameters   broadcast  
  37. 37. Results   [Hunter  et  al,  SOCC  2011]   3×  speedup  from  caching,  4.5x  from  broadcast  
  38. 38. Conviva  GeoReport   Hive   20   Spark   0.5   Time  (hours)   0   5   10   15   20  SQL  aggregations  on  many  keys  w/  same  filter  40×  gain  over  Hive  from  avoiding  repeated  I/O,  deserialization  and  filtering  
  39. 39. Other  Programming  Models  Pregel  on  Spark  (Bagel)   » 200  lines  of  code  Iterative  MapReduce   » 200  lines  of  code  Hive  on  Spark  (Shark)   » 5000  lines  of  code   » Compatible  with  Apache  Hive   » Machine  learning  ops.  in  Scala  
  40. 40. Outline  Programming  interface  Examples  User  applications  Implementation  Demo  Current  research:  Spark  Streaming  
  41. 41. Implementation  Runs  on  Apache  Mesos  cluster   Spark   Hadoop   MPI  manager  to  coexist  w/  Hadoop   …   Mesos  Supports  any  Hadoop  storage  system  (HDFS,  HBase,  …)   Node   Node   Node  Easy  local  mode  and  EC2  launch  scripts  No  changes  to  Scala  
  42. 42. Task  Scheduler  Runs  general  DAGs   A:   B:  Pipelines  functions   G:  within  a  stage   Stage  1   groupBy  Cache-­‐aware  data   C:   D:   F:  reuse  &  locality   map   E:   join  Partitioning-­‐aware  to  avoid  shuffles   Stage  2   union   Stage  3   =  cached  data  partition  
  43. 43. Language  Integration  Scala  closures  are  Serializable  Java  objects   » Serialize  on  master,  load  &  run  on  workers  Not  quite  enough   » Nested  closures  may  reference  entire  outer  scope,   pulling  in  non-­‐Serializable  variables  not  used  inside   » Solution:  bytecode  analysis  +  reflection    
  44. 44. Interactive  Spark  Modified  Scala  interpreter  to  allow  Spark  to  be  used  interactively  from  the  command  line   » Track  variables  that  each  line  depends  on   » Ship  generated  classes  to  workers  Enables  in-­‐memory  exploration  of  big  data  
  45. 45. Outline  Programming  interface  Examples  User  applications  Implementation  Demo  Current  research:  Spark  Streaming  
  46. 46. Outline  Programming  interface  Examples  User  applications  Implementation  Demo  Current  research:  Spark  Streaming  
  47. 47. Motivation  Many  “big  data”  apps  need  to  work  in  real  time   » Site  statistics,  spam  filtering,  intrusion  detection,  …  To  scale  to  100s  of  nodes,  need:   » Fault-­‐tolerance:  for  both  crashes  and  stragglers   » Efficiency:  don’t  consume  many  resources  beyond   base  processing  Challenging  in  existing  streaming  systems  
  48. 48. Traditional  Streaming  Systems  Continuous  processing  model   » Each  node  has  long-­‐lived  state   » For  each  record,  update  state  &  send  new  records   mutable  state   input  records   push   node  1   node  3   input  records   node  2  
  49. 49. Traditional  Streaming  Systems   Fault  tolerance  via  replication  or  upstream  backup:  input   node  1   input   node  1   node  3   node  3  input   node  2   input   synchronization   node  2   node  1’   standby   node  3’   node  2’  
  50. 50. Traditional  Streaming  Systems   Fault  tolerance  via  replication  or  upstream  backup:  input   node  1   input   node  1   node  3   node  3  input   node  2   input   synchronization   node  2   node  1’   standby   node  3’   Fast  recovery,  but  2x   Only  need  1  standby,   hardware  cost   node  2’   but  slow  to  recover  
  51. 51. Traditional  Streaming  Systems   Fault  tolerance  via  replication  or  upstream  backup:  input   node  1   input   node  1   node  3   node  3  input   node  2   input   synchronization   node  2   node  1’   standby   node  3’   node  2’   Neither  approach  can  handle  stragglers   [Borealis,  Flux]   [Hwang  et  al,  2005]  
  52. 52. Observation  Batch  processing  models,  such  as  MapReduce,  do  provide  fault  tolerance  efficiently   » Divide  job  into  deterministic  tasks   » Rerun  failed/slow  tasks  in  parallel  on  other  nodes  Idea:  run  streaming  computations  as  a  series  of  small,  deterministic  batch  jobs   » Same  recovery  schemes  at  much  smaller  timescale   » To  make  latency  low,  store  state  in  RDDs  
  53. 53. Discretized  Stream  Processing   batch  operation   t  =  1:   input   pull  immutable  dataset   immutable  dataset   (stored  reliably)   (output  or  state);   stored  as  RDD   t  =  2:   input   …   …  
  54. 54. Fault  Recovery  Checkpoint  state  RDDs  periodically  If  a  node  fails/straggles,  rebuild  lost  RDD  partitions  in  parallel  on  other  nodes   map   input  dataset   output  dataset   Faster  recovery  than  upstream  backup,   without  the  cost  of  replication  
  55. 55. How  Fast  Can  It  Go?  Can  process  over  60M  records/s  (6  GB/s)  on  100  nodes  at  sub-­‐second  latency     80   40   Records/s  (millions)   Records/s  (millions)   Grep   Top  K  Words   60   30   40   20   20   1  sec   10   1  sec   2  sec   2  sec   0   0   0   50   100   0   50   100   Nodes  in  Cluster   Nodes  in  Cluster   Max  throughput  under  a  given  latency  (1  or  2s)  
  56. 56. Comparison  with  Storm   Grep   Top  K  Words   80   30   Spark   Spark   MB/s  per  node  MB/s  per  node   60   Storm   Storm   20   40   10   20   0   0   100   1000   100   1000   Record  Size  (bytes)   Record  Size  (bytes)   Storm  limited  to  100K  records/s/node   Lack  Spark’s   Also  tried  S4:  10K  records/s/node   FT  guarantees   Commercial  systems:  O(500K)  total  
  57. 57. How  Fast  Can  It  Recover?  Recovers  from  faults/stragglers  within  1  second   Failure  happens   1.5   Interval  Processing   1   Time  (s)   0.5   0   Time  (s)   0   5   10   15   20   25   Sliding  WordCount  on  20  nodes  with  10s  checkpoint  interval  
  58. 58. Programming  Interface  Extension  to  Spark:  Spark  Streaming   » All  Spark  operators  plus  new  “stateful”  ones     “Discretized  stream”  // Running count of pageviews by URL (D-­‐stream)   ones views counts t  =  1:  views = readStream("http:...", "1s")ones = views.map(ev => (ev.url, 1)) map   reduce  counts = ones.runningReduce(_ + _) t  =  2:   ... =  RDD   =  partition  
  59. 59. Incremental  Operators  words.reduceByWindow(“5s”, max) words.reduceByWindow(“5s”, _+_, _-_) words   interval   sliding   words   interval   sliding   max   max   counts   counts   t-­‐1   t-­‐1   t   t   t+1   t+1   t+2   t+2   –   t+3       t+3   max   + t+4       t+4   +   Associative  function   Associative  &  invertible  
  60. 60. Applications   Mobile  Millennium   Conviva  video  dashboard   traffic  estimation   4.0   2000   GPS  observations  /  sec  Active  video  sessions   3.5   3.0   1500   2.5   (millions)   2.0   1000   1.5   1.0   500   0.5   0.0   0   0   16   32   48   64   0   20   40   60   80   Nodes  in  Cluster   Nodes  in  Cluster   (>50  session-­‐level  metrics)   (online  EM  algorithm)  
  61. 61. Unifying  Streaming  and  Batch  D-­‐streams  and  RDDs  can  seamlessly  be  combined   » Same  execution  and  fault  recovery  models  Enables  powerful  features:   » Combining  streams  with  historical  data:   ! used  in  MM   pageViews.join(historicCounts).map(...) application     » Interactive  ad-­‐hoc  queries  on  stream  state:   ! pageViews.slice(“21:00”,“21:05”).topK(10) used  in   Conviva  app  
  62. 62. Benefits  of  a  Unified  Stack  Write  each  algorithm  only  once  Reuse  data  across  streaming  &  batch  jobs  Query  stream  state  instead  of  waiting  for  import    Some  users  were  doing  this  manually!   » Conviva  anomaly  detection,  Quantifind  dashboard  
  63. 63. Conclusion  “Big  data”  is  moving  beyond  one-­‐pass  batch  jobs,  to  low-­‐latency  apps  that  need  data  sharing  RDDs  offer  fault-­‐tolerant  sharing  at  memory  speed  Spark  uses  them  to  combine  streaming,  batch  &  interactive  analytics  in  one  system   www.spark-­‐project.org  
  64. 64. Related  Work  DryadLINQ,  FlumeJava   » Similar  “distributed  collection”  API,  but  cannot  reuse   datasets  efficiently  across  queries  GraphLab,  Piccolo,  BigTable,  RAMCloud   » Fine-­‐grained  writes  requiring  replication  or  checkpoints  Iterative  MapReduce  (e.g.  Twister,  HaLoop)   » Implicit  data  sharing  for  a  fixed  computation  pattern  Relational  databases   » Lineage/provenance,  logical  logging,  materialized  views  Caching  systems  (e.g.  Nectar)   » Store  data  in  files,  no  explicit  control  over  what  is  cached  
  65. 65. Behavior  with  Not  Enough  RAM   100   68.8  Iteration  time  (s)   58.1   80   40.7   60   29.7   40   11.5   20   0   Cache   25%   50%   75%   Fully   disabled   cached   %  of  working  set  in  memory  
  66. 66. RDDs  for  Debugging  Debugging  general  distributed  apps  is  very  hard  However,  Spark  and  other  recent  frameworks  run  deterministic  tasks  for  fault  tolerance  Leverage  this  determinism  for  debugging:   » Log  lineage  for  all  RDDs  created  (small)   » Let  user  replay  any  task  in  jdb,  rebuild  any  RDD  to   query  it  interactively,  or  check  assertions  
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×