Making Big Data Analytics Interactive and Real-­Time
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Making Big Data Analytics Interactive and Real-­Time

  • 4,860 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
4,860
On Slideshare
2,698
From Embeds
2,162
Number of Embeds
27

Actions

Shares
Downloads
125
Comments
0
Likes
9

Embeds 2,162

http://arindampaul.tumblr.com 1,099
http://dschool.co 456
http://www.scoop.it 250
http://localhost 171
http://www.dschool.co 114
http://iitkgpsv.org 26
http://jcc.dschool.co 8
http://www.iitkgpsv.org 5
http://stanthonyschool.dschool.co 4
http://feeds.feedburner.com 4
http://www.stxaviershighschoolpatna.dschool.co 3
http://feedly.com 2
http://josephconventpatna.dschool.co 2
http://mountcarmelschoolpatna.dschool.co 2
http://www.chiranjeeveeplayschool.com 2
http://search.yahoo.com 2
http://frankanthonyschoolbangalore.dschool.co 2
http://bwgsdoranda.dschool.co 1
http://www.arindampaul.tumblr.com 1
http://www.surendranathcentenary.dschool.co 1
http://clarku.dschool.co 1
http://brancifortemiddlesantacruz.dschool.co 1
http://bridgeford.dschool.co 1
http://webcache.googleusercontent.com 1
http://christchurchhighschool.dschool.co 1
https://twitter.com 1
http://www.ns2.riteapps.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Spark  Making  Big  Data  Analytics  Interactive  and  Real-­‐Time  Matei  Zaharia,  in  collaboration  with  Mosharaf  Chowdhury,  Tathagata  Das,  Timothy  Hunter,  Ankur  Dave,  Haoyuan  Li,  Justin  Ma,  Murphy  McCauley,  Michael  Franklin,  Scott  Shenker,  Ion  Stoica    spark-­‐project.org   UC  BERKELEY  
  • 2. Overview  Spark  is  a  parallel  framework  that  provides:   » Efficient  primitives  for  in-­‐memory  data  sharing   » Simple  APIs  in  Scala,  Java,  SQL   » High  generality  (applicable  to  many  emerging  problems)  This  talk  will  cover:   » What  it  does   » How  people  are  using  it  (including  some  surprises)   » Current  research  
  • 3. Motivation  MapReduce  simplified  data  analysis  on  large,  unreliable  clusters  But  as  soon  as  it  got  popular,  users  wanted  more:   » More  complex,  multi-­‐pass  applications  (e.g.   machine  learning,  graph  algorithms)   » More  interactive  ad-­‐hoc  queries   » More  real-­‐time  stream  processing   One  reaction:  specialized  models  for  some  of   these  apps  (e.g.  Pregel,  Storm)  
  • 4. Motivation  Complex  apps,  streaming,  and  interactive  queries  all  need  one  thing  that  MapReduce  lacks:   Efficient  primitives  for  data  sharing    
  • 5. Examples   HDFS   HDFS   HDFS   HDFS   read   write   read   write   iter.  1   iter.  2   .    .    .   Input   HDFS   query  1   result  1   read   query  2   result  2   query  3   result  3   Input   .    .    .   Slow  due  to  replication  and  disk  I/O,   but  necessary  for  fault  tolerance  
  • 6. Goal:  Sharing  at  Memory  Speed   iter.  1   iter.  2   .    .    .   Input   query  1   one-­‐time   processing   query  2   query  3   Input   .    .    .  10-­‐100×  faster  than  network/disk,  but  how  to  get  FT?  
  • 7. Challenge  How  to  design  a  distributed  memory  abstraction   that  is  both  fault-­‐tolerant  and  efficient?  
  • 8. Existing  Systems  Existing  in-­‐memory  storage  systems  have  interfaces  based  on  fine-­‐grained  updates   » Reads  and  writes  to  cells  in  a  table   » E.g.  databases,  key-­‐value  stores,  distributed  memory  Requires  replicating  data  or  logs  across  nodes  for  fault  tolerance  è  expensive!   » 10-­‐100x  slower  than  memory  write…  
  • 9. Solution:  Resilient  Distributed  Datasets  (RDDs)  Provide  an  interface  based  on  coarse-­‐grained  operations  (map,  group-­‐by,  join,  …)  Efficient  fault  recovery  using  lineage   » Log  one  operation  to  apply  to  many  elements   » Recompute  lost  partitions  on  failure   » No  cost  if  nothing  fails  
  • 10. RDD  Recovery   iter.  1   iter.  2   .    .    .   Input   query  1   one-­‐time   processing   query  2   query  3   Input   .    .    .  
  • 11. Generality  of  RDDs  RDDs  can  express  surprisingly  many  parallel  algorithms   » These  naturally  apply  same  operation  to  many  items  Capture  many  current  programming  models   » Data  flow  models:  MapReduce,  Dryad,  SQL,  …   » Specialized  models  for  iterative  apps:  Pregel,  iterative   MapReduce,  PowerGraph,  …  Allow  these  models  to  be  composed  
  • 12. Outline  Programming  interface  Examples  User  applications  Implementation  Demo  Current  research:  Spark  Streaming  
  • 13. Spark  Programming  Interface  Language-­‐integrated  API  in  Scala*  Provides:   » Resilient  distributed  datasets  (RDDs)   •  Partitioned  collections  with  controllable  caching   » Operations  on  RDDs   •  Transformations  (define  RDDs),  actions  (compute  results)   » Restricted  shared  variables  (broadcast,  accumulators)  *Also  Java,  SQL  and  soon  Python  
  • 14. Example:  Log  Mining   Load  error  messages  from  a  log  into  memory,  then   interactively  search  for  various  patterns   Base  RDD   Msgs.  1  lines = spark.textFile(“hdfs://...”) Transformed  RDD   Worker   results  errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘t’)(2)) tasks   Block  1   Driver  cachedMsgs = messages.persist() Action  cachedMsgs.filter(_.contains(“foo”)).countcachedMsgs.filter(_.contains(“bar”)).count Msgs.  2   Worker  . . . Msgs.  3   Worker   Block  2   Result:  sull-­‐text  s1  TB  data  in  5-­‐7  sec   fcaled  to   earch  of  Wikipedia   in  <1  sec  (vs  ec  for  on-­‐disk  data)  ata)   (vs  170  s 20  sec  for  on-­‐disk  d Block  3  
  • 15. Fault  Recovery  RDDs  track  the  graph  of  transformations  that  built  them  (their  lineage)  to  rebuild  lost  data  E.g.:   messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘t’)(2))     HadoopRDD     FilteredRDD     MappedRDD     path  =  hdfs://…   func  =  _.contains(...)   func  =  _.split(…)  
  • 16. Fault  Recovery  Results   140   119   Failure  happens   Iteratrion  time  (s)   120   100   81   80   57   56   58   58   57   59   57   59   60   40   20   0   1   2   3   4   5   6   7   8   9   10   Iteration  
  • 17. Example:  Logistic  Regression  Goal:  find  best  line  separating  two  sets  of  points   random  initial  line   + + ++ + – – + + + – – –– + + – – – – target  
  • 18. Example:  Logistic  Regression  val data = spark.textFile(...).map(readPoint).persist()var w = Vector.random(D)for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient}println("Final w: " + w)
  • 19. Logistic  Regression  Performance   60   110  s  /  iteration  Running  Time  (min)   50   40   Hadoop   30   Spark   20   10   first  iteration  80  s   0   further  iterations  6  s   1   10   20   30   Number  of  Iterations  
  • 20. Example:  Collaborative  Filtering  Goal:  predict  users’  movie  ratings  based  on  past  ratings  of  other  movies   1  ?  ?  4  5  ?  3   ?  ?  3  5  ?  ?  3   R  =   Users   5  ?  5  ?  ?  ?  1   4  ?  ?  ?  ?  2  ?   Movies  
  • 21. Model  and  Algorithm  Model  R  as  product  of  user  and  movie  feature  matrices  A  and  B  of  size  U×K  and  M×K   BT   R   =   A  Alternating  Least  Squares  (ALS)   » Start  with  random  A  &  B   » Optimize  user  vectors  (A)  based  on  movies   » Optimize  movie  vectors  (B)  based  on  users   » Repeat  until  converged  
  • 22. Serial  ALS  var R = readRatingsMatrix(...)var A = // array of U random vectorsvar B = // array of M random vectorsfor (i <- 1 to ITERATIONS) { A = (0 until U).map(i => updateUser(i, B, R)) B = (0 until M).map(i => updateMovie(i, A, R))} Range  objects  
  • 23. Naïve  Spark  ALS  var R = readRatingsMatrix(...)var A = // array of U random vectorsvar B = // array of M random vectorsfor (i <- 1 to ITERATIONS) { A = spark.parallelize(0 until U, numSlices) Problem:   .map(i => updateUser(i, B, R)) R  re-­‐sent   .collect() B = spark.parallelize(0 until M, numSlices) to  all  nodes   .map(i => updateMovie(i, A, R)) in  each   .collect() iteration  }
  • 24. Efficient  Spark  ALS  var R = spark.broadcast(readRatingsMatrix(...)) Solution:   mark  R  as  var A = // array of U random vectors broadcast  var B = // array of M random vectors variable  for (i <- 1 to ITERATIONS) { A = spark.parallelize(0 until U, numSlices) .map(i => updateUser(i, B, R.value)) .collect() B = spark.parallelize(0 until M, numSlices) .map(i => updateMovie(i, A, R.value)) .collect()} Result:  3×  performance  improvement  
  • 25. Scaling  Up  Broadcast   Initial  version  (HDFS)   Cornet  P2P  broadcast   250   250   Communication   Communication   200   Computation   200   Computation  Iteration  time  (s)   Iteration  time  (s)   150   150   100   100   50   50   0   0   10   30   60   90   10   30   60   90   Number  of  machines   Number  of  machines   [Chowdhury  et  al,  SIGCOMM  2011]  
  • 26. Other  RDD  Operations   map   flatMap   filter   union   Transformations   sample   join  (define  a  new  RDD)   groupByKey   cogroup   reduceByKey   cross   sortByKey   …   collect   Actions   reduce   (return  a  result  to   count   driver  program)   save   …  
  • 27. Spark  in  Java  lines.filter(_.contains(“error”)).count()JavaRDD<String> lines = sc.textFile(...);lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); }}).count();
  • 28. Spark  in  Python  (Coming  Soon!)  lines = sc.textFile(sys.argv[1])counts = lines.flatMap(lambda x: x.split( )) .map(lambda x: (x, 1)) .reduceByKey(lambda x, y: x + y)
  • 29. Outline  Programming  interface  Examples  User  applications  Implementation  Demo  Current  research:  Spark  Streaming  
  • 30. Spark  Users   400+  user  meetup,  20+  contributors  
  • 31. User  Applications  Crowdsourced  traffic  estimation  (Mobile  Millennium)  Video  analytics  &  anomaly  detection  (Conviva)  Ad-­‐hoc  queries  from  web  app  (Quantifind)  Twitter  spam  classification  (Monarch)  DNA  sequence  analysis  (SNAP)  …  
  • 32. Mobile  Millennium  Project  Estimate  city  traffic  from  GPS-­‐equipped  vehicles  (e.g.  SF  taxis)  
  • 33. Sample  Data   Credit:  Tim  Hunter,  with  support  of  the  Mobile  Millennium  team;  P.I.  Alex  Bayen;  traffic.berkeley.edu  
  • 34. Challenge  Data  is  noisy  and  sparse  (1  sample/minute)  Must  infer  path  taken  by  each  vehicle  in  addition  to  travel  time  distribution  on  each  link  
  • 35. Challenge  Data  is  noisy  and  sparse  (1  sample/minute)  Must  infer  path  taken  by  each  vehicle  in  addition  to  travel  time  distribution  on  each  link  
  • 36. Solution  EM  algorithm  to  estimate  paths  and  travel  time  distributions  simultaneously   observations   flatMap   weighted  path  samples   groupByKey   link  parameters   broadcast  
  • 37. Results   [Hunter  et  al,  SOCC  2011]   3×  speedup  from  caching,  4.5x  from  broadcast  
  • 38. Conviva  GeoReport   Hive   20   Spark   0.5   Time  (hours)   0   5   10   15   20  SQL  aggregations  on  many  keys  w/  same  filter  40×  gain  over  Hive  from  avoiding  repeated  I/O,  deserialization  and  filtering  
  • 39. Other  Programming  Models  Pregel  on  Spark  (Bagel)   » 200  lines  of  code  Iterative  MapReduce   » 200  lines  of  code  Hive  on  Spark  (Shark)   » 5000  lines  of  code   » Compatible  with  Apache  Hive   » Machine  learning  ops.  in  Scala  
  • 40. Outline  Programming  interface  Examples  User  applications  Implementation  Demo  Current  research:  Spark  Streaming  
  • 41. Implementation  Runs  on  Apache  Mesos  cluster   Spark   Hadoop   MPI  manager  to  coexist  w/  Hadoop   …   Mesos  Supports  any  Hadoop  storage  system  (HDFS,  HBase,  …)   Node   Node   Node  Easy  local  mode  and  EC2  launch  scripts  No  changes  to  Scala  
  • 42. Task  Scheduler  Runs  general  DAGs   A:   B:  Pipelines  functions   G:  within  a  stage   Stage  1   groupBy  Cache-­‐aware  data   C:   D:   F:  reuse  &  locality   map   E:   join  Partitioning-­‐aware  to  avoid  shuffles   Stage  2   union   Stage  3   =  cached  data  partition  
  • 43. Language  Integration  Scala  closures  are  Serializable  Java  objects   » Serialize  on  master,  load  &  run  on  workers  Not  quite  enough   » Nested  closures  may  reference  entire  outer  scope,   pulling  in  non-­‐Serializable  variables  not  used  inside   » Solution:  bytecode  analysis  +  reflection    
  • 44. Interactive  Spark  Modified  Scala  interpreter  to  allow  Spark  to  be  used  interactively  from  the  command  line   » Track  variables  that  each  line  depends  on   » Ship  generated  classes  to  workers  Enables  in-­‐memory  exploration  of  big  data  
  • 45. Outline  Programming  interface  Examples  User  applications  Implementation  Demo  Current  research:  Spark  Streaming  
  • 46. Outline  Programming  interface  Examples  User  applications  Implementation  Demo  Current  research:  Spark  Streaming  
  • 47. Motivation  Many  “big  data”  apps  need  to  work  in  real  time   » Site  statistics,  spam  filtering,  intrusion  detection,  …  To  scale  to  100s  of  nodes,  need:   » Fault-­‐tolerance:  for  both  crashes  and  stragglers   » Efficiency:  don’t  consume  many  resources  beyond   base  processing  Challenging  in  existing  streaming  systems  
  • 48. Traditional  Streaming  Systems  Continuous  processing  model   » Each  node  has  long-­‐lived  state   » For  each  record,  update  state  &  send  new  records   mutable  state   input  records   push   node  1   node  3   input  records   node  2  
  • 49. Traditional  Streaming  Systems   Fault  tolerance  via  replication  or  upstream  backup:  input   node  1   input   node  1   node  3   node  3  input   node  2   input   synchronization   node  2   node  1’   standby   node  3’   node  2’  
  • 50. Traditional  Streaming  Systems   Fault  tolerance  via  replication  or  upstream  backup:  input   node  1   input   node  1   node  3   node  3  input   node  2   input   synchronization   node  2   node  1’   standby   node  3’   Fast  recovery,  but  2x   Only  need  1  standby,   hardware  cost   node  2’   but  slow  to  recover  
  • 51. Traditional  Streaming  Systems   Fault  tolerance  via  replication  or  upstream  backup:  input   node  1   input   node  1   node  3   node  3  input   node  2   input   synchronization   node  2   node  1’   standby   node  3’   node  2’   Neither  approach  can  handle  stragglers   [Borealis,  Flux]   [Hwang  et  al,  2005]  
  • 52. Observation  Batch  processing  models,  such  as  MapReduce,  do  provide  fault  tolerance  efficiently   » Divide  job  into  deterministic  tasks   » Rerun  failed/slow  tasks  in  parallel  on  other  nodes  Idea:  run  streaming  computations  as  a  series  of  small,  deterministic  batch  jobs   » Same  recovery  schemes  at  much  smaller  timescale   » To  make  latency  low,  store  state  in  RDDs  
  • 53. Discretized  Stream  Processing   batch  operation   t  =  1:   input   pull  immutable  dataset   immutable  dataset   (stored  reliably)   (output  or  state);   stored  as  RDD   t  =  2:   input   …   …  
  • 54. Fault  Recovery  Checkpoint  state  RDDs  periodically  If  a  node  fails/straggles,  rebuild  lost  RDD  partitions  in  parallel  on  other  nodes   map   input  dataset   output  dataset   Faster  recovery  than  upstream  backup,   without  the  cost  of  replication  
  • 55. How  Fast  Can  It  Go?  Can  process  over  60M  records/s  (6  GB/s)  on  100  nodes  at  sub-­‐second  latency     80   40   Records/s  (millions)   Records/s  (millions)   Grep   Top  K  Words   60   30   40   20   20   1  sec   10   1  sec   2  sec   2  sec   0   0   0   50   100   0   50   100   Nodes  in  Cluster   Nodes  in  Cluster   Max  throughput  under  a  given  latency  (1  or  2s)  
  • 56. Comparison  with  Storm   Grep   Top  K  Words   80   30   Spark   Spark   MB/s  per  node  MB/s  per  node   60   Storm   Storm   20   40   10   20   0   0   100   1000   100   1000   Record  Size  (bytes)   Record  Size  (bytes)   Storm  limited  to  100K  records/s/node   Lack  Spark’s   Also  tried  S4:  10K  records/s/node   FT  guarantees   Commercial  systems:  O(500K)  total  
  • 57. How  Fast  Can  It  Recover?  Recovers  from  faults/stragglers  within  1  second   Failure  happens   1.5   Interval  Processing   1   Time  (s)   0.5   0   Time  (s)   0   5   10   15   20   25   Sliding  WordCount  on  20  nodes  with  10s  checkpoint  interval  
  • 58. Programming  Interface  Extension  to  Spark:  Spark  Streaming   » All  Spark  operators  plus  new  “stateful”  ones     “Discretized  stream”  // Running count of pageviews by URL (D-­‐stream)   ones views counts t  =  1:  views = readStream("http:...", "1s")ones = views.map(ev => (ev.url, 1)) map   reduce  counts = ones.runningReduce(_ + _) t  =  2:   ... =  RDD   =  partition  
  • 59. Incremental  Operators  words.reduceByWindow(“5s”, max) words.reduceByWindow(“5s”, _+_, _-_) words   interval   sliding   words   interval   sliding   max   max   counts   counts   t-­‐1   t-­‐1   t   t   t+1   t+1   t+2   t+2   –   t+3       t+3   max   + t+4       t+4   +   Associative  function   Associative  &  invertible  
  • 60. Applications   Mobile  Millennium   Conviva  video  dashboard   traffic  estimation   4.0   2000   GPS  observations  /  sec  Active  video  sessions   3.5   3.0   1500   2.5   (millions)   2.0   1000   1.5   1.0   500   0.5   0.0   0   0   16   32   48   64   0   20   40   60   80   Nodes  in  Cluster   Nodes  in  Cluster   (>50  session-­‐level  metrics)   (online  EM  algorithm)  
  • 61. Unifying  Streaming  and  Batch  D-­‐streams  and  RDDs  can  seamlessly  be  combined   » Same  execution  and  fault  recovery  models  Enables  powerful  features:   » Combining  streams  with  historical  data:   ! used  in  MM   pageViews.join(historicCounts).map(...) application     » Interactive  ad-­‐hoc  queries  on  stream  state:   ! pageViews.slice(“21:00”,“21:05”).topK(10) used  in   Conviva  app  
  • 62. Benefits  of  a  Unified  Stack  Write  each  algorithm  only  once  Reuse  data  across  streaming  &  batch  jobs  Query  stream  state  instead  of  waiting  for  import    Some  users  were  doing  this  manually!   » Conviva  anomaly  detection,  Quantifind  dashboard  
  • 63. Conclusion  “Big  data”  is  moving  beyond  one-­‐pass  batch  jobs,  to  low-­‐latency  apps  that  need  data  sharing  RDDs  offer  fault-­‐tolerant  sharing  at  memory  speed  Spark  uses  them  to  combine  streaming,  batch  &  interactive  analytics  in  one  system   www.spark-­‐project.org  
  • 64. Related  Work  DryadLINQ,  FlumeJava   » Similar  “distributed  collection”  API,  but  cannot  reuse   datasets  efficiently  across  queries  GraphLab,  Piccolo,  BigTable,  RAMCloud   » Fine-­‐grained  writes  requiring  replication  or  checkpoints  Iterative  MapReduce  (e.g.  Twister,  HaLoop)   » Implicit  data  sharing  for  a  fixed  computation  pattern  Relational  databases   » Lineage/provenance,  logical  logging,  materialized  views  Caching  systems  (e.g.  Nectar)   » Store  data  in  files,  no  explicit  control  over  what  is  cached  
  • 65. Behavior  with  Not  Enough  RAM   100   68.8  Iteration  time  (s)   58.1   80   40.7   60   29.7   40   11.5   20   0   Cache   25%   50%   75%   Fully   disabled   cached   %  of  working  set  in  memory  
  • 66. RDDs  for  Debugging  Debugging  general  distributed  apps  is  very  hard  However,  Spark  and  other  recent  frameworks  run  deterministic  tasks  for  fault  tolerance  Leverage  this  determinism  for  debugging:   » Log  lineage  for  all  RDDs  created  (small)   » Let  user  replay  any  task  in  jdb,  rebuild  any  RDD  to   query  it  interactively,  or  check  assertions