Spark	  and	  Shark	  High-­‐Speed	  In-­‐Memory	  Analytics	  over	  Hadoop	  and	  Hive	  Data	  Matei	  Zaharia,	  in	 ...
What	  is	  Spark?	  Not	  a	  modified	  version	  of	  Hadoop	  Separate,	  fast,	  MapReduce-­‐like	  engine	   » In-­‐m...
What	  is	  Shark?	  Port	  of	  Apache	  Hive	  to	  run	  on	  Spark	  Compatible	  with	  existing	  Hive	  data,	  met...
Project	  History	  Spark	  project	  started	  in	  2009,	  open	  sourced	  2010	  Shark	  started	  summer	  2011,	  al...
This	  Talk	  Spark	  programming	  model	  User	  applications	  Shark	  overview	  Demo	  Next	  major	  addition:	  Str...
Why	  a	  New	  Programming	  Model?	  MapReduce	  greatly	  simplified	  big	  data	  analysis	  But	  as	  soon	  as	  it...
Data	  Sharing	  in	  MapReduce	                 HDFS	                      HDFS	                                HDFS	    ...
Data	  Sharing	  in	  Spark	                         iter.	  1	           iter.	  2	                .	  	  .	  	  .	     I...
Spark	  Programming	  Model	  Key	  idea:	  resilient	  distributed	  datasets	  (RDDs)	    » Distributed	  collections	  ...
Example:	  Log	  Mining	   Load	  error	  messages	  from	  a	  log	  into	  memory,	  then	   interactively	  search	  fo...
Fault	  Tolerance	  RDDs	  track	  the	  series	  of	  transformations	  used	  to	  build	  them	  (their	  lineage)	  to...
Example:	  Logistic	  Regression	  val data = spark.textFile(...).map(readPoint).cache()var w = Vector.random(D)          ...
Logistic	  Regression	  Performance	                               4500	                               4000	              ...
Supported	  Operators	  map              reduce        samplefilter           count         cogroupgroupBy          reduce...
Other	  Engine	  Features	  General	  graphs	  of	  operators	  (e.g.	  map-­‐reduce-­‐reduce)	  Hash-­‐based	  reduces	  ...
Spark	  Users	  
User	  Applications	  In-­‐memory	  analytics	  &	  anomaly	  detection	  (Conviva)	  Interactive	  queries	  on	  data	  ...
Conviva	  GeoReport	    Hive	                                                   20	  Spark	              0.5	             ...
Mobile	  Millennium	  Project	  Estimate	  city	  traffic	  from	  crowdsourced	  GPS	  data	                               ...
Shark:	  Hive	  on	  Spark	  
Motivation	  Hive	  is	  great,	  but	  Hadoop’s	  execution	  engine	  makes	  even	  the	  smallest	  queries	  take	  m...
Hive	  Architecture	                	                	  	  Client	                   CLI	             JDBC	               ...
Shark	  Architecture	                	                	  	  Client	                   CLI	             JDBC	              ...
Efficient	  In-­‐Memory	  Storage	  Simply	  caching	  Hive	  records	  as	  Java	  objects	  is	  inefficient	  due	  to	  hi...
Efficient	  In-­‐Memory	  Storage	  Simply	  caching	  Hive	  records	  as	  Java	  objects	  is	  inefficient	  due	  to	  hi...
Using	  Shark	  CREATE TABLE mydata_cached AS SELECT …	  Run	  standard	  HiveQL	  on	  it,	  including	  UDFs	         » ...
Benchmark	  Query	  1	  SELECT * FROM grep WHERE field LIKE ‘%XYZ%’;  Shark (cached) 12s          Shark                   ...
Benchmark	  Query	  2	  SELECT sourceIP, AVG(pageRank), SUM(adRevenue) AS earningsFROM rankings AS R, userVisits AS V ON R...
Demo	  
What’s	  Next?	  Recall	  that	  Spark’s	  model	  was	  motivated	  by	  two	  emerging	  uses	  (interactive	  and	  mul...
Streaming	  Spark	        Extends	  Spark	  to	  perform	  streaming	  computations	        Runs	  as	  a	  series	  of	  ...
Streaming	  Spark	        Extends	  Spark	  to	  perform	  streaming	  computations	        Runs	  as	  a	  series	  of	  ...
Streaming	  Spark	        Extends	  Spark	  to	  perform	  streaming	  computations	        Runs	  as	  a	  series	  of	  ...
Conclusion	  Spark	  and	  Shark	  speed	  up	  your	  interactive	  and	  complex	  analytics	  on	  Hadoop	  data	  Down...
Behavior	  with	  Not	  Enough	  RAM	                                 100	                                                ...
Software	  Stack	            Shark	                      Bagel	                Streaming	      (Hive	  on	  Spark)	     (P...
Thank You!             Page 37
Upcoming SlideShare
Loading in...5
×

Spark and shark

35,915

Published on

Published in: Technology, Education
1 Comment
103 Likes
Statistics
Notes
  • how can i work on mobile data processing done by Hadoop? please suggest
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
35,915
On Slideshare
0
From Embeds
0
Number of Embeds
43
Actions
Shares
0
Downloads
0
Comments
1
Likes
103
Embeds 0
No embeds

No notes for slide

Spark and shark

  1. 1. Spark  and  Shark  High-­‐Speed  In-­‐Memory  Analytics  over  Hadoop  and  Hive  Data  Matei  Zaharia,  in  collaboration  with  Mosharaf  Chowdhury,  Tathagata  Das,  Ankur  Dave,  Cliff  Engle,  Michael  Franklin,  Haoyuan  Li,  Antonio  Lupher,  Justin  Ma,  Murphy  McCauley,  Scott  Shenker,  Ion  Stoica,  Reynold  Xin    UC  Berkeley  spark-­‐project.org   UC  BERKELEY  
  2. 2. What  is  Spark?  Not  a  modified  version  of  Hadoop  Separate,  fast,  MapReduce-­‐like  engine   » In-­‐memory  data  storage  for  very  fast  iterative  queries   » General  execution  graphs  and  powerful  optimizations   » Up  to  40x  faster  than  Hadoop  Compatible  with  Hadoop’s  storage  APIs   » Can  read/write  to  any  Hadoop-­‐supported  system,   including  HDFS,  HBase,  SequenceFiles,  etc  
  3. 3. What  is  Shark?  Port  of  Apache  Hive  to  run  on  Spark  Compatible  with  existing  Hive  data,  metastores,  and  queries  (HiveQL,  UDFs,  etc)  Similar  speedups  of  up  to  40x  
  4. 4. Project  History  Spark  project  started  in  2009,  open  sourced  2010  Shark  started  summer  2011,  alpha  April  2012  In  use  at  Berkeley,  Princeton,  Klout,  Foursquare,  Conviva,  Quantifind,  Yahoo!  Research  &  others  200+  member  meetup,  500+  watchers  on  GitHub  
  5. 5. This  Talk  Spark  programming  model  User  applications  Shark  overview  Demo  Next  major  addition:  Streaming  Spark  
  6. 6. Why  a  New  Programming  Model?  MapReduce  greatly  simplified  big  data  analysis  But  as  soon  as  it  got  popular,  users  wanted  more:   » More  complex,  multi-­‐stage  applications  (e.g.   iterative  graph  algorithms  and  machine  learning)   » More  interactive  ad-­‐hoc  queries  Both  multi-­‐stage  and  interactive  apps  require  faster  data  sharing  across  parallel  jobs  
  7. 7. Data  Sharing  in  MapReduce   HDFS   HDFS   HDFS   HDFS   read   write   read   write   iter.  1   iter.  2   .    .    .   Input   HDFS   query  1   result  1   read   query  2   result  2   query  3   result  3   Input   .    .    .  Slow  due  to  replication,  serialization,  and  disk  IO  
  8. 8. Data  Sharing  in  Spark   iter.  1   iter.  2   .    .    .   Input   query  1   one-­‐time   processing   query  2   query  3   Input   Distributed   memory   .    .    .   10-­‐100×  faster  than  network  and  disk  
  9. 9. Spark  Programming  Model  Key  idea:  resilient  distributed  datasets  (RDDs)   » Distributed  collections  of  objects  that  can  be  cached   in  memory  across  cluster  nodes   » Manipulated  through  various  parallel  operators   » Automatically  rebuilt  on  failure  Interface   » Clean  language-­‐integrated  API  in  Scala   » Can  be  used  interactively  from  Scala  console  
  10. 10. Example:  Log  Mining   Load  error  messages  from  a  log  into  memory,  then   interactively  search  for  various  patterns   Base  RDD   Cache  1  lines = spark.textFile(“hdfs://...”) Transformed  RDD   Worker   results  errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘t’)(2)) tasks   Block  1   Driver  cachedMsgs = messages.cache() Action  cachedMsgs.filter(_.contains(“foo”)).countcachedMsgs.filter(_.contains(“bar”)).count Cache  2   Worker  . . . Cache  3   Worker   Block  2   Result:  sull-­‐text  s1  TB  data  in  5-­‐7  sec   fcaled  to   earch  of  Wikipedia   in  <1  sec  (vs  ec  for  on-­‐disk  data)  ata)   (vs  170  s 20  sec  for  on-­‐disk  d Block  3  
  11. 11. Fault  Tolerance  RDDs  track  the  series  of  transformations  used  to  build  them  (their  lineage)  to  recompute  lost  data  E.g:   messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘t’)(2))     HadoopRDD   FilteredRDD   MappedRDD   path  =  hdfs://…   func  =  _.contains(...)   func  =  _.split(…)  
  12. 12. Example:  Logistic  Regression  val data = spark.textFile(...).map(readPoint).cache()var w = Vector.random(D) Load  data  in  memory  once   Initial  parameter  vector  for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient Repeated  MapReduce  steps  } to  do  gradient  descent  println("Final w: " + w)
  13. 13. Logistic  Regression  Performance   4500   4000   127  s  /  iteration   3500  Running  Time  (s)   3000   2500   Hadoop   2000   Spark   1500   1000   500   first  iteration  174  s   0   further  iterations  6  s   1   5   10   20   30   Number  of  Iterations  
  14. 14. Supported  Operators  map reduce samplefilter count cogroupgroupBy reduceByKey takesort groupByKey partitionByjoin first pipeleftOuterJoin union saverightOuterJoin cross ...
  15. 15. Other  Engine  Features  General  graphs  of  operators  (e.g.  map-­‐reduce-­‐reduce)  Hash-­‐based  reduces  (faster  than  Hadoop’s  sort)  Controlled  data  partitioning  to  lower  communication   PageRank  Performance   200   171   Iteration  time  (s)   Hadoop   150   Basic  Spark   100   72   50   23   Spark  +  Controlled   Partitioning   0  
  16. 16. Spark  Users  
  17. 17. User  Applications  In-­‐memory  analytics  &  anomaly  detection  (Conviva)  Interactive  queries  on  data  streams  (Quantifind)  Exploratory  log  analysis  (Foursquare)  Traffic  estimation  w/  GPS  data  (Mobile  Millennium)  Twitter  spam  classification  (Monarch)  .  .  .  
  18. 18. Conviva  GeoReport   Hive   20  Spark   0.5   Time  (hours)   0   5   10   15   20  Group  aggregations  on  many  keys  w/  same  filter  40×  gain  over  Hive  from  avoiding  repeated  reading,  deserialization  and  filtering  
  19. 19. Mobile  Millennium  Project  Estimate  city  traffic  from  crowdsourced  GPS  data   Iterative  EM  algorithm   scaling  to  160  nodes   Credit:  Tim  Hunter,  with  support  of  the  Mobile  Millennium  team;  P.I.  Alex  Bayen;  traffic.berkeley.edu  
  20. 20. Shark:  Hive  on  Spark  
  21. 21. Motivation  Hive  is  great,  but  Hadoop’s  execution  engine  makes  even  the  smallest  queries  take  minutes  Scala  is  good  for  programmers,  but  many  data  users  only  know  SQL  Can  we  extend  Hive  to  run  on  Spark?  
  22. 22. Hive  Architecture        Client   CLI   JDBC   Driver   Meta   Physical  Plan   store   SQL   Query   Parser   Optimizer   Execution   MapReduce   HDFS  
  23. 23. Shark  Architecture        Client   CLI   JDBC   Driver   Cache  Mgr.   Meta   Physical  Plan   store   SQL   Query   Parser   Optimizer   Execution   Spark   HDFS   [Engle  et  al,  SIGMOD  2012]  
  24. 24. Efficient  In-­‐Memory  Storage  Simply  caching  Hive  records  as  Java  objects  is  inefficient  due  to  high  per-­‐object  overhead  Instead,  Shark  employs  column-­‐oriented  storage  using  arrays  of  primitive  types   Row  Storage   Column  Storage   1   john   4.1   1   2   3   2   mike   3.5   john   mike   sally   3   sally   6.4   4.1   3.5   6.4  
  25. 25. Efficient  In-­‐Memory  Storage  Simply  caching  Hive  records  as  Java  objects  is  inefficient  due  to  high  per-­‐object  overhead  Instead,  Shark  employs  column-­‐oriented  storage  using  arrays  of  primitive  types   Row  Storage   Column  Storage   1   john   4.1   1   2   3  Benefit:  similarly  compact  size  to  serialized  data,   2   mike   3.5   faster  to  access   sally   but  >5x   john   mike   3   sally   6.4   4.1   3.5   6.4  
  26. 26. Using  Shark  CREATE TABLE mydata_cached AS SELECT …  Run  standard  HiveQL  on  it,  including  UDFs   » A  few  esoteric  features  are  not  yet  supported  Can  also  call  from  Scala  to  mix  with  Spark     Early  alpha  release  at  shark.cs.berkeley.edu  
  27. 27. Benchmark  Query  1  SELECT * FROM grep WHERE field LIKE ‘%XYZ%’; Shark (cached) 12s Shark 182s Hive 207s 0 50 100 150 200 250 Execution Time (secs)
  28. 28. Benchmark  Query  2  SELECT sourceIP, AVG(pageRank), SUM(adRevenue) AS earningsFROM rankings AS R, userVisits AS V ON R.pageURL = V.destURLWHERE V.visitDate BETWEEN ‘1999-01-01’ AND ‘2000-01-01’GROUP BY V.sourceIPORDER BY earnings DESCLIMIT 1; Shark (cached) 126s Shark 270s Hive 447s 0 100 200 300 400 500 Execution Time (secs)
  29. 29. Demo  
  30. 30. What’s  Next?  Recall  that  Spark’s  model  was  motivated  by  two  emerging  uses  (interactive  and  multi-­‐stage  apps)  Another  emerging  use  case  that  needs  fast  data  sharing  is  stream  processing   » Track  and  update  state  in  memory  as  events  arrive   » Large-­‐scale  reporting,  click  analysis,  spam  filtering,  etc  
  31. 31. Streaming  Spark   Extends  Spark  to  perform  streaming  computations   Runs  as  a  series  of  small  (~1  s)  batch  jobs,  keeping   state  in  memory  as  fault-­‐tolerant  RDDs   Intermix  seamlessly  with  batch  and  ad-­‐hoc  queries   map   reduceByWindow   tweetStream T=1   .flatMap(_.toLower.split) .map(word => (word, 1)) .reduceByWindow(5, _ + _) T=2   …  [Zaharia  et  al,  HotCloud  2012]  
  32. 32. Streaming  Spark   Extends  Spark  to  perform  streaming  computations   Runs  as  a  series  of  small  (~1  s)  batch  jobs,  keeping   state  in  memory  as  fault-­‐tolerant  RDDs   Intermix  seamlessly  with  batch  and  ad-­‐hoc  queries   map   reduceByWindow   tweetStream T=1   .flatMap(_.toLower.split) Result:  can  process  42  million  records/second   .map(word => (word, 1)) (4  GB/s)  on  100  nodes  at  sub-­‐second  latency   .reduceByWindow(5, _ + _) T=2   …  [Zaharia  et  al,  HotCloud  2012]  
  33. 33. Streaming  Spark   Extends  Spark  to  perform  streaming  computations   Runs  as  a  series  of  small  (~1  s)  batch  jobs,  keeping   state  in  memory  as  fault-­‐tolerant  RDDs   Intermix  seamlessly  with  batch  and  ad-­‐hoc  queries   map   reduceByWindow   tweetStream T=1   .flatMap(_.toLower.split) .map(word => (word, 1))Alpha  coming  this  summer   .reduceByWindow(5, _ + _) T=2   …  [Zaharia  et  al,  HotCloud  2012]  
  34. 34. Conclusion  Spark  and  Shark  speed  up  your  interactive  and  complex  analytics  on  Hadoop  data  Download  and  docs:  www.spark-­‐project.org     » Easy  to  run  locally,  on  EC2,  or  on  Mesos  and  soon  YARN  User  meetup:  meetup.com/spark-­‐users    Training  camp  at  Berkeley  in  August!   matei@berkeley.edu  /  @matei_zaharia  
  35. 35. Behavior  with  Not  Enough  RAM   100   68.8  Iteration  time  (s)   58.1   80   40.7   60   29.7   40   11.5   20   0   Cache   25%   50%   75%   Fully   disabled   cached   %  of  working  set  in  memory  
  36. 36. Software  Stack   Shark   Bagel   Streaming   (Hive  on  Spark)   (Pregel  on  Spark)   Spark   …   Spark   Local   Apache   EC2   YARN   mode   Mesos  
  37. 37. Thank You! Page 37

×