Spark 2013-04-17

2,348
-1

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,348
On Slideshare
0
From Embeds
0
Number of Embeds
31
Actions
Shares
0
Downloads
49
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Spark 2013-04-17

  1. 1. The Spark Ecosystem Michael Malak technicaltidbit.com
  2. 2. Agenda• What Hadoop gives us• What everyone is complaining about in 2013• Spark – Berkeley Team – BDAS (Berkeley Data Analytics Stack) – RDDs (Resilient Distributed Datasets) – Shark – Spark Streaming – Other Spark subsystemsGlobal Big Data Apr 23, 2013 technicaltidbit.com 2
  3. 3. What Hadoop Gives Us• HDFS• Map/ReduceGlobal Big Data Apr 23, 2013 technicaltidbit.com 3
  4. 4. Hadoop: HDFS Image from mark.chmarny.comGlobal Big Data Apr 23, 2013 technicaltidbit.com 4
  5. 5. Hadoop: Map/ReduceImage from blog.octo.com Image from people.apache.org/~rdonkin Global Big Data Apr 23, 2013 technicaltidbit.com 5
  6. 6. Map/Reduce Tools Pig Script HiveQL Hbase App Pig Hive Hadoop LinuxGlobal Big Data Apr 23, 2013 technicaltidbit.com 6
  7. 7. Hadoop Distribution Dogs in the Race Hadoop Distribution Query Tool Apache Drill StingerGlobal Big Data Apr 23, 2013 technicaltidbit.com 7
  8. 8. Other Open Source Solutions• Druid• SparkGlobal Big Data Apr 23, 2013 technicaltidbit.com 8
  9. 9. Not just caching, but streaming• 1st generation: HDFS• 2nd generation: Caching & “Push” Map/Reduce• 3rd generation: StreamingGlobal Big Data Apr 23, 2013 technicaltidbit.com 9
  10. 10. Berkeley Team• 40 students• 8 faculty• 3 staff software engineers• Silicon Valley style skunkworks office Image from Ian Stoica’s slides from Strata 2013 presentation space• 2 years into 6 year program Global Big Data Apr 23, 2013 technicaltidbit.com 10
  11. 11. BDAS (Berkeley Data Analytics Stack) Spark Streaming Bagel App Shark App App Bagel Shark Spark Streaming Spark App Spark Hadoop/HDFS Mesos LinuxGlobal Big Data Apr 23, 2013 technicaltidbit.com 11
  12. 12. RDDs (Resilient Distributed Dataset) Image from Matei Zaharia’s paperGlobal Big Data Apr 23, 2013 technicaltidbit.com 12
  13. 13. RDDs: Laziness x => x.startsWith(“ERROR”)lines = spark.textFile(“hdfs://...”)errors = lines.filter(_.startsWith(“ERROR”)) .map(_.split(‘t’)(2)) All Lazy .filter(_.contains(“foo”))cnt = errors.count Action!Global Big Data Apr 23, 2013 technicaltidbit.com 13
  14. 14. RDDs: Transformations vs. Actions Transformations Actions map(func) reduce(func) filter(func) collect() flatMap(func) count() sample(withReplacement, take(n) frac, seed) first() union(otherDataset) saveAsTextFile(path) groupByKey[K,V](func) saveAsSequenceFile(path) reduceByKey[K,V](func) foreach(func) join[K,V,W](otherDataset) cogroup[K,V,W1,W2](other1, other2) cartesian[U](otherDataset) sortByKey[K,V] [K,V] in Scala same as <K,V> templates in C++, JavaGlobal Big Data Apr 23, 2013 technicaltidbit.com 14
  15. 15. Hive vs. Shark Shark HiveQL HiveQL HiveQL HiveQL HDFS files HDFS files + RDDsGlobal Big Data Apr 23, 2013 technicaltidbit.com 15
  16. 16. Shark: Copy from HDFS to RDDCREATE TABLE wiki_small_in_mem TBLPROPERTIES ("shark.cache" = "true") AS SELECT * FROM wiki;CREATE TABLE wiki_cached AS SELECT * FROM wiki;Creates a table that is stored in a cluster’s memory using RDD.cache().Global Big Data Apr 23, 2013 technicaltidbit.com 16
  17. 17. Shark: Just a Shim Shark Images from Reynold Xin’s presentationGlobal Big Data Apr 23, 2013 technicaltidbit.com 17
  18. 18. What about “Big Data”? PB TB Shark Effectiveness Shark Effectiveness GB MB KBGlobal Big Data Apr 23, 2013 technicaltidbit.com 18
  19. 19. Median Hadoop job input size Image from Reynold Xin’s presentationGlobal Big Data Apr 23, 2013 technicaltidbit.com 19
  20. 20. Spark Streaming: Motivationx1,000,000 clients HDFS Global Big Data Apr 23, 2013 technicaltidbit.com 20
  21. 21. Spark Streaming: DStream• “A series of small batches” {{“id”: “hercman”}, {{“id”: “hercman”}, {{“id”: “shewolf”}, “eventType”: “eventType”: “eventType”: “error”}} RDD 2 sec “buyGoods”}} “buyGoods”}} {{“id”: “shewolf”}, “eventType”: “error”}} RDD 2 sec ... {{“id”: “catlover”}, {{“id”: “hercman”}, “eventType”: “eventType”: “logOff”}} RDD 2 sec “buyGoods”}} DStream DStreamGlobal Big Data Apr 23, 2013 technicaltidbit.com 21
  22. 22. Spark Streaming: DAG DStream Dstream .filter( .foreach( _.eventType== println) bj] “error”) [EvO am tre DStream[String] Dstream DsKafka .transform (JSON) Ds tr eam [Ev Ob j] Dstream Dstream .filter( .foreach( _.eventType== println) “buyGoods”) The DAG Dstream .map((_.id,1)) Dstream .groupByKey Global Big Data Apr 23, 2013 technicaltidbit.com 22
  23. 23. Spark Streaming: Example Code// Initializeval ssc = new StreamingContext(“mesos://localhost”, “games”, Seconds(2), …)val msgs = ssc.kafkaStream[String](prm, topic, StorageLevel.MEMORY_AND_DISK)// DAGval events:Dstream[evObj] = messages.transform(rdd => rdd.map(new evObj(_))val errorCounts = events.filter(_.eventType == “error”)errorCounts.foreach(rdd => println(rdd.count))val usersBuying = events.filter(_.eventType == “buyGoods”).map((_.id,1)) .groupByKeyusersBuying.foreach(rdd => println(rdd.count))// Gossc.startGlobal Big Data Apr 23, 2013 technicaltidbit.com 23
  24. 24. Stateful Spark StreamingClass ErrorsPerUser(var numErrors:Int=0) extends Serializableval updateFunc = (values:Seq[evObj], state:Option[ErrorsPerUser]) => { if (values.find(_.eventType == “logOff”) == None) None else { values.foreach(e => { e.eventType match { “error” => state.numErrors += 1 } }) Option(state) }}// DAGval events:Dstream[evObj] = messages.transform(rdd => rdd.map(new evObj(_))val errorCounts = events.filter(_.eventType == “error”)val states = errorCounts.map((_.id,1)) .updateStateByKey[ErrorsPerUser](updateFunc)// Off-DAGstates.foreach(rdd => println(“Num users experiencing errors:” + rdd.count))Global Big Data Apr 23, 2013 technicaltidbit.com 24
  25. 25. Other Spark Subsystems• Bagel (similar to Google Pregel)• Sparkler (Matrix decomposition)• (Machine Learning)Global Big Data Apr 23, 2013 technicaltidbit.com 25
  26. 26. Teaser • Future Meetup: Machine learning from real-time data streamsGlobal Big Data Apr 23, 2013 technicaltidbit.com 26

×