Making Big Data Analytics Interactive and Real-­Time
Upcoming SlideShare
Loading in...5
×
 

Making Big Data Analytics Interactive and Real-­Time

on

  • 4,197 views

 

Statistics

Views

Total Views
4,197
Views on SlideShare
2,242
Embed Views
1,955

Actions

Likes
8
Downloads
96
Comments
0

25 Embeds 1,955

http://arindampaul.tumblr.com 951
http://dschool.co 410
http://www.scoop.it 245
http://localhost 168
http://www.dschool.co 113
http://iitkgpsv.org 26
http://jcc.dschool.co 8
http://www.iitkgpsv.org 5
http://feeds.feedburner.com 4
http://www.stxaviershighschoolpatna.dschool.co 3
http://josephconventpatna.dschool.co 2
http://stanthonyschool.dschool.co 2
http://feedly.com 2
http://frankanthonyschoolbangalore.dschool.co 2
http://search.yahoo.com 2
http://www.chiranjeeveeplayschool.com 2
http://mountcarmelschoolpatna.dschool.co 2
http://clarku.dschool.co 1
http://brancifortemiddlesantacruz.dschool.co 1
http://christchurchhighschool.dschool.co 1
http://www.surendranathcentenary.dschool.co 1
http://bridgeford.dschool.co 1
https://twitter.com 1
http://bwgsdoranda.dschool.co 1
http://webcache.googleusercontent.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

 Making Big Data Analytics Interactive and Real-­Time Making Big Data Analytics Interactive and Real-­Time Presentation Transcript

  • Spark  Making  Big  Data  Analytics  Interactive  and  Real-­‐Time  Matei  Zaharia,  in  collaboration  with  Mosharaf  Chowdhury,  Tathagata  Das,  Timothy  Hunter,  Ankur  Dave,  Haoyuan  Li,  Justin  Ma,  Murphy  McCauley,  Michael  Franklin,  Scott  Shenker,  Ion  Stoica    spark-­‐project.org   UC  BERKELEY  
  • Overview  Spark  is  a  parallel  framework  that  provides:   » Efficient  primitives  for  in-­‐memory  data  sharing   » Simple  APIs  in  Scala,  Java,  SQL   » High  generality  (applicable  to  many  emerging  problems)  This  talk  will  cover:   » What  it  does   » How  people  are  using  it  (including  some  surprises)   » Current  research  
  • Motivation  MapReduce  simplified  data  analysis  on  large,  unreliable  clusters  But  as  soon  as  it  got  popular,  users  wanted  more:   » More  complex,  multi-­‐pass  applications  (e.g.   machine  learning,  graph  algorithms)   » More  interactive  ad-­‐hoc  queries   » More  real-­‐time  stream  processing   One  reaction:  specialized  models  for  some  of   these  apps  (e.g.  Pregel,  Storm)  
  • Motivation  Complex  apps,  streaming,  and  interactive  queries  all  need  one  thing  that  MapReduce  lacks:   Efficient  primitives  for  data  sharing    
  • Examples   HDFS   HDFS   HDFS   HDFS   read   write   read   write   iter.  1   iter.  2   .    .    .   Input   HDFS   query  1   result  1   read   query  2   result  2   query  3   result  3   Input   .    .    .   Slow  due  to  replication  and  disk  I/O,   but  necessary  for  fault  tolerance  
  • Goal:  Sharing  at  Memory  Speed   iter.  1   iter.  2   .    .    .   Input   query  1   one-­‐time   processing   query  2   query  3   Input   .    .    .  10-­‐100×  faster  than  network/disk,  but  how  to  get  FT?  
  • Challenge  How  to  design  a  distributed  memory  abstraction   that  is  both  fault-­‐tolerant  and  efficient?  
  • Existing  Systems  Existing  in-­‐memory  storage  systems  have  interfaces  based  on  fine-­‐grained  updates   » Reads  and  writes  to  cells  in  a  table   » E.g.  databases,  key-­‐value  stores,  distributed  memory  Requires  replicating  data  or  logs  across  nodes  for  fault  tolerance  è  expensive!   » 10-­‐100x  slower  than  memory  write…  
  • Solution:  Resilient  Distributed  Datasets  (RDDs)  Provide  an  interface  based  on  coarse-­‐grained  operations  (map,  group-­‐by,  join,  …)  Efficient  fault  recovery  using  lineage   » Log  one  operation  to  apply  to  many  elements   » Recompute  lost  partitions  on  failure   » No  cost  if  nothing  fails  
  • RDD  Recovery   iter.  1   iter.  2   .    .    .   Input   query  1   one-­‐time   processing   query  2   query  3   Input   .    .    .  
  • Generality  of  RDDs  RDDs  can  express  surprisingly  many  parallel  algorithms   » These  naturally  apply  same  operation  to  many  items  Capture  many  current  programming  models   » Data  flow  models:  MapReduce,  Dryad,  SQL,  …   » Specialized  models  for  iterative  apps:  Pregel,  iterative   MapReduce,  PowerGraph,  …  Allow  these  models  to  be  composed  
  • Outline  Programming  interface  Examples  User  applications  Implementation  Demo  Current  research:  Spark  Streaming  
  • Spark  Programming  Interface  Language-­‐integrated  API  in  Scala*  Provides:   » Resilient  distributed  datasets  (RDDs)   •  Partitioned  collections  with  controllable  caching   » Operations  on  RDDs   •  Transformations  (define  RDDs),  actions  (compute  results)   » Restricted  shared  variables  (broadcast,  accumulators)  *Also  Java,  SQL  and  soon  Python  
  • Example:  Log  Mining   Load  error  messages  from  a  log  into  memory,  then   interactively  search  for  various  patterns   Base  RDD   Msgs.  1  lines = spark.textFile(“hdfs://...”) Transformed  RDD   Worker   results  errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘t’)(2)) tasks   Block  1   Driver  cachedMsgs = messages.persist() Action  cachedMsgs.filter(_.contains(“foo”)).countcachedMsgs.filter(_.contains(“bar”)).count Msgs.  2   Worker  . . . Msgs.  3   Worker   Block  2   Result:  sull-­‐text  s1  TB  data  in  5-­‐7  sec   fcaled  to   earch  of  Wikipedia   in  <1  sec  (vs  ec  for  on-­‐disk  data)  ata)   (vs  170  s 20  sec  for  on-­‐disk  d Block  3  
  • Fault  Recovery  RDDs  track  the  graph  of  transformations  that  built  them  (their  lineage)  to  rebuild  lost  data  E.g.:   messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘t’)(2))     HadoopRDD     FilteredRDD     MappedRDD     path  =  hdfs://…   func  =  _.contains(...)   func  =  _.split(…)  
  • Fault  Recovery  Results   140   119   Failure  happens   Iteratrion  time  (s)   120   100   81   80   57   56   58   58   57   59   57   59   60   40   20   0   1   2   3   4   5   6   7   8   9   10   Iteration  
  • Example:  Logistic  Regression  Goal:  find  best  line  separating  two  sets  of  points   random  initial  line   + + ++ + – – + + + – – –– + + – – – – target  
  • Example:  Logistic  Regression  val data = spark.textFile(...).map(readPoint).persist()var w = Vector.random(D)for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient}println("Final w: " + w)
  • Logistic  Regression  Performance   60   110  s  /  iteration  Running  Time  (min)   50   40   Hadoop   30   Spark   20   10   first  iteration  80  s   0   further  iterations  6  s   1   10   20   30   Number  of  Iterations  
  • Example:  Collaborative  Filtering  Goal:  predict  users’  movie  ratings  based  on  past  ratings  of  other  movies   1  ?  ?  4  5  ?  3   ?  ?  3  5  ?  ?  3   R  =   Users   5  ?  5  ?  ?  ?  1   4  ?  ?  ?  ?  2  ?   Movies  
  • Model  and  Algorithm  Model  R  as  product  of  user  and  movie  feature  matrices  A  and  B  of  size  U×K  and  M×K   BT   R   =   A  Alternating  Least  Squares  (ALS)   » Start  with  random  A  &  B   » Optimize  user  vectors  (A)  based  on  movies   » Optimize  movie  vectors  (B)  based  on  users   » Repeat  until  converged  
  • Serial  ALS  var R = readRatingsMatrix(...)var A = // array of U random vectorsvar B = // array of M random vectorsfor (i <- 1 to ITERATIONS) { A = (0 until U).map(i => updateUser(i, B, R)) B = (0 until M).map(i => updateMovie(i, A, R))} Range  objects  
  • Naïve  Spark  ALS  var R = readRatingsMatrix(...)var A = // array of U random vectorsvar B = // array of M random vectorsfor (i <- 1 to ITERATIONS) { A = spark.parallelize(0 until U, numSlices) Problem:   .map(i => updateUser(i, B, R)) R  re-­‐sent   .collect() B = spark.parallelize(0 until M, numSlices) to  all  nodes   .map(i => updateMovie(i, A, R)) in  each   .collect() iteration  }
  • Efficient  Spark  ALS  var R = spark.broadcast(readRatingsMatrix(...)) Solution:   mark  R  as  var A = // array of U random vectors broadcast  var B = // array of M random vectors variable  for (i <- 1 to ITERATIONS) { A = spark.parallelize(0 until U, numSlices) .map(i => updateUser(i, B, R.value)) .collect() B = spark.parallelize(0 until M, numSlices) .map(i => updateMovie(i, A, R.value)) .collect()} Result:  3×  performance  improvement  
  • Scaling  Up  Broadcast   Initial  version  (HDFS)   Cornet  P2P  broadcast   250   250   Communication   Communication   200   Computation   200   Computation  Iteration  time  (s)   Iteration  time  (s)   150   150   100   100   50   50   0   0   10   30   60   90   10   30   60   90   Number  of  machines   Number  of  machines   [Chowdhury  et  al,  SIGCOMM  2011]  
  • Other  RDD  Operations   map   flatMap   filter   union   Transformations   sample   join  (define  a  new  RDD)   groupByKey   cogroup   reduceByKey   cross   sortByKey   …   collect   Actions   reduce   (return  a  result  to   count   driver  program)   save   …  
  • Spark  in  Java  lines.filter(_.contains(“error”)).count()JavaRDD<String> lines = sc.textFile(...);lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); }}).count();
  • Spark  in  Python  (Coming  Soon!)  lines = sc.textFile(sys.argv[1])counts = lines.flatMap(lambda x: x.split( )) .map(lambda x: (x, 1)) .reduceByKey(lambda x, y: x + y)
  • Outline  Programming  interface  Examples  User  applications  Implementation  Demo  Current  research:  Spark  Streaming  
  • Spark  Users   400+  user  meetup,  20+  contributors  
  • User  Applications  Crowdsourced  traffic  estimation  (Mobile  Millennium)  Video  analytics  &  anomaly  detection  (Conviva)  Ad-­‐hoc  queries  from  web  app  (Quantifind)  Twitter  spam  classification  (Monarch)  DNA  sequence  analysis  (SNAP)  …  
  • Mobile  Millennium  Project  Estimate  city  traffic  from  GPS-­‐equipped  vehicles  (e.g.  SF  taxis)  
  • Sample  Data   Credit:  Tim  Hunter,  with  support  of  the  Mobile  Millennium  team;  P.I.  Alex  Bayen;  traffic.berkeley.edu  
  • Challenge  Data  is  noisy  and  sparse  (1  sample/minute)  Must  infer  path  taken  by  each  vehicle  in  addition  to  travel  time  distribution  on  each  link  
  • Challenge  Data  is  noisy  and  sparse  (1  sample/minute)  Must  infer  path  taken  by  each  vehicle  in  addition  to  travel  time  distribution  on  each  link  
  • Solution  EM  algorithm  to  estimate  paths  and  travel  time  distributions  simultaneously   observations   flatMap   weighted  path  samples   groupByKey   link  parameters   broadcast  
  • Results   [Hunter  et  al,  SOCC  2011]   3×  speedup  from  caching,  4.5x  from  broadcast  
  • Conviva  GeoReport   Hive   20   Spark   0.5   Time  (hours)   0   5   10   15   20  SQL  aggregations  on  many  keys  w/  same  filter  40×  gain  over  Hive  from  avoiding  repeated  I/O,  deserialization  and  filtering  
  • Other  Programming  Models  Pregel  on  Spark  (Bagel)   » 200  lines  of  code  Iterative  MapReduce   » 200  lines  of  code  Hive  on  Spark  (Shark)   » 5000  lines  of  code   » Compatible  with  Apache  Hive   » Machine  learning  ops.  in  Scala  
  • Outline  Programming  interface  Examples  User  applications  Implementation  Demo  Current  research:  Spark  Streaming  
  • Implementation  Runs  on  Apache  Mesos  cluster   Spark   Hadoop   MPI  manager  to  coexist  w/  Hadoop   …   Mesos  Supports  any  Hadoop  storage  system  (HDFS,  HBase,  …)   Node   Node   Node  Easy  local  mode  and  EC2  launch  scripts  No  changes  to  Scala  
  • Task  Scheduler  Runs  general  DAGs   A:   B:  Pipelines  functions   G:  within  a  stage   Stage  1   groupBy  Cache-­‐aware  data   C:   D:   F:  reuse  &  locality   map   E:   join  Partitioning-­‐aware  to  avoid  shuffles   Stage  2   union   Stage  3   =  cached  data  partition  
  • Language  Integration  Scala  closures  are  Serializable  Java  objects   » Serialize  on  master,  load  &  run  on  workers  Not  quite  enough   » Nested  closures  may  reference  entire  outer  scope,   pulling  in  non-­‐Serializable  variables  not  used  inside   » Solution:  bytecode  analysis  +  reflection    
  • Interactive  Spark  Modified  Scala  interpreter  to  allow  Spark  to  be  used  interactively  from  the  command  line   » Track  variables  that  each  line  depends  on   » Ship  generated  classes  to  workers  Enables  in-­‐memory  exploration  of  big  data  
  • Outline  Programming  interface  Examples  User  applications  Implementation  Demo  Current  research:  Spark  Streaming  
  • Outline  Programming  interface  Examples  User  applications  Implementation  Demo  Current  research:  Spark  Streaming  
  • Motivation  Many  “big  data”  apps  need  to  work  in  real  time   » Site  statistics,  spam  filtering,  intrusion  detection,  …  To  scale  to  100s  of  nodes,  need:   » Fault-­‐tolerance:  for  both  crashes  and  stragglers   » Efficiency:  don’t  consume  many  resources  beyond   base  processing  Challenging  in  existing  streaming  systems  
  • Traditional  Streaming  Systems  Continuous  processing  model   » Each  node  has  long-­‐lived  state   » For  each  record,  update  state  &  send  new  records   mutable  state   input  records   push   node  1   node  3   input  records   node  2  
  • Traditional  Streaming  Systems   Fault  tolerance  via  replication  or  upstream  backup:  input   node  1   input   node  1   node  3   node  3  input   node  2   input   synchronization   node  2   node  1’   standby   node  3’   node  2’  
  • Traditional  Streaming  Systems   Fault  tolerance  via  replication  or  upstream  backup:  input   node  1   input   node  1   node  3   node  3  input   node  2   input   synchronization   node  2   node  1’   standby   node  3’   Fast  recovery,  but  2x   Only  need  1  standby,   hardware  cost   node  2’   but  slow  to  recover  
  • Traditional  Streaming  Systems   Fault  tolerance  via  replication  or  upstream  backup:  input   node  1   input   node  1   node  3   node  3  input   node  2   input   synchronization   node  2   node  1’   standby   node  3’   node  2’   Neither  approach  can  handle  stragglers   [Borealis,  Flux]   [Hwang  et  al,  2005]  
  • Observation  Batch  processing  models,  such  as  MapReduce,  do  provide  fault  tolerance  efficiently   » Divide  job  into  deterministic  tasks   » Rerun  failed/slow  tasks  in  parallel  on  other  nodes  Idea:  run  streaming  computations  as  a  series  of  small,  deterministic  batch  jobs   » Same  recovery  schemes  at  much  smaller  timescale   » To  make  latency  low,  store  state  in  RDDs  
  • Discretized  Stream  Processing   batch  operation   t  =  1:   input   pull  immutable  dataset   immutable  dataset   (stored  reliably)   (output  or  state);   stored  as  RDD   t  =  2:   input   …   …  
  • Fault  Recovery  Checkpoint  state  RDDs  periodically  If  a  node  fails/straggles,  rebuild  lost  RDD  partitions  in  parallel  on  other  nodes   map   input  dataset   output  dataset   Faster  recovery  than  upstream  backup,   without  the  cost  of  replication  
  • How  Fast  Can  It  Go?  Can  process  over  60M  records/s  (6  GB/s)  on  100  nodes  at  sub-­‐second  latency     80   40   Records/s  (millions)   Records/s  (millions)   Grep   Top  K  Words   60   30   40   20   20   1  sec   10   1  sec   2  sec   2  sec   0   0   0   50   100   0   50   100   Nodes  in  Cluster   Nodes  in  Cluster   Max  throughput  under  a  given  latency  (1  or  2s)  
  • Comparison  with  Storm   Grep   Top  K  Words   80   30   Spark   Spark   MB/s  per  node  MB/s  per  node   60   Storm   Storm   20   40   10   20   0   0   100   1000   100   1000   Record  Size  (bytes)   Record  Size  (bytes)   Storm  limited  to  100K  records/s/node   Lack  Spark’s   Also  tried  S4:  10K  records/s/node   FT  guarantees   Commercial  systems:  O(500K)  total  
  • How  Fast  Can  It  Recover?  Recovers  from  faults/stragglers  within  1  second   Failure  happens   1.5   Interval  Processing   1   Time  (s)   0.5   0   Time  (s)   0   5   10   15   20   25   Sliding  WordCount  on  20  nodes  with  10s  checkpoint  interval  
  • Programming  Interface  Extension  to  Spark:  Spark  Streaming   » All  Spark  operators  plus  new  “stateful”  ones     “Discretized  stream”  // Running count of pageviews by URL (D-­‐stream)   ones views counts t  =  1:  views = readStream("http:...", "1s")ones = views.map(ev => (ev.url, 1)) map   reduce  counts = ones.runningReduce(_ + _) t  =  2:   ... =  RDD   =  partition  
  • Incremental  Operators  words.reduceByWindow(“5s”, max) words.reduceByWindow(“5s”, _+_, _-_) words   interval   sliding   words   interval   sliding   max   max   counts   counts   t-­‐1   t-­‐1   t   t   t+1   t+1   t+2   t+2   –   t+3       t+3   max   + t+4       t+4   +   Associative  function   Associative  &  invertible  
  • Applications   Mobile  Millennium   Conviva  video  dashboard   traffic  estimation   4.0   2000   GPS  observations  /  sec  Active  video  sessions   3.5   3.0   1500   2.5   (millions)   2.0   1000   1.5   1.0   500   0.5   0.0   0   0   16   32   48   64   0   20   40   60   80   Nodes  in  Cluster   Nodes  in  Cluster   (>50  session-­‐level  metrics)   (online  EM  algorithm)  
  • Unifying  Streaming  and  Batch  D-­‐streams  and  RDDs  can  seamlessly  be  combined   » Same  execution  and  fault  recovery  models  Enables  powerful  features:   » Combining  streams  with  historical  data:   ! used  in  MM   pageViews.join(historicCounts).map(...) application     » Interactive  ad-­‐hoc  queries  on  stream  state:   ! pageViews.slice(“21:00”,“21:05”).topK(10) used  in   Conviva  app  
  • Benefits  of  a  Unified  Stack  Write  each  algorithm  only  once  Reuse  data  across  streaming  &  batch  jobs  Query  stream  state  instead  of  waiting  for  import    Some  users  were  doing  this  manually!   » Conviva  anomaly  detection,  Quantifind  dashboard  
  • Conclusion  “Big  data”  is  moving  beyond  one-­‐pass  batch  jobs,  to  low-­‐latency  apps  that  need  data  sharing  RDDs  offer  fault-­‐tolerant  sharing  at  memory  speed  Spark  uses  them  to  combine  streaming,  batch  &  interactive  analytics  in  one  system   www.spark-­‐project.org  
  • Related  Work  DryadLINQ,  FlumeJava   » Similar  “distributed  collection”  API,  but  cannot  reuse   datasets  efficiently  across  queries  GraphLab,  Piccolo,  BigTable,  RAMCloud   » Fine-­‐grained  writes  requiring  replication  or  checkpoints  Iterative  MapReduce  (e.g.  Twister,  HaLoop)   » Implicit  data  sharing  for  a  fixed  computation  pattern  Relational  databases   » Lineage/provenance,  logical  logging,  materialized  views  Caching  systems  (e.g.  Nectar)   » Store  data  in  files,  no  explicit  control  over  what  is  cached  
  • Behavior  with  Not  Enough  RAM   100   68.8  Iteration  time  (s)   58.1   80   40.7   60   29.7   40   11.5   20   0   Cache   25%   50%   75%   Fully   disabled   cached   %  of  working  set  in  memory  
  • RDDs  for  Debugging  Debugging  general  distributed  apps  is  very  hard  However,  Spark  and  other  recent  frameworks  run  deterministic  tasks  for  fault  tolerance  Leverage  this  determinism  for  debugging:   » Log  lineage  for  all  RDDs  created  (small)   » Let  user  replay  any  task  in  jdb,  rebuild  any  RDD  to   query  it  interactively,  or  check  assertions