The Spark Ecosystem

       Michael Malak


   technicaltidbit.com
Agenda
•    What Hadoop gives us
•    What everyone is complaining about in 2013
•    Spark
       – Berkeley Team
       – BDAS (Berkeley Data Analytics Stack)
       – RDDs (Resilient Distributed Datasets)
       – Shark
       – Spark Streaming
       – Other Spark subsystems
Global Big Data Apr 23, 2013   technicaltidbit.com   2
What Hadoop Gives Us
• HDFS
• Map/Reduce




Global Big Data Apr 23, 2013   technicaltidbit.com   3
Hadoop: HDFS




                                 Image from mark.chmarny.com




Global Big Data Apr 23, 2013      technicaltidbit.com          4
Hadoop: Map/Reduce




Image from blog.octo.com




                                                        Image from people.apache.org/~rdonkin




   Global Big Data Apr 23, 2013   technicaltidbit.com                                    5
Map/Reduce Tools


          Pig Script                     HiveQL          Hbase App

              Pig                         Hive

                                        Hadoop

                                          Linux




Global Big Data Apr 23, 2013       technicaltidbit.com               6
Hadoop Distribution Dogs in the
                  Race
                Hadoop Distribution             Query Tool

                                                     Apache Drill




                                                Stinger



Global Big Data Apr 23, 2013   technicaltidbit.com                  7
Other Open Source Solutions
• Druid
• Spark




Global Big Data Apr 23, 2013   technicaltidbit.com   8
Not just caching, but streaming
•    1st generation: HDFS
•    2nd generation: Caching & “Push” Map/Reduce
•    3rd generation: Streaming




Global Big Data Apr 23, 2013   technicaltidbit.com   9
Berkeley Team
• 40 students
• 8 faculty
• 3 staff software
  engineers
• Silicon Valley style
  skunkworks office                      Image from Ian Stoica’s slides from Strata 2013 presentation
  space
• 2 years into 6 year
  program
 Global Big Data Apr 23, 2013      technicaltidbit.com                                            10
BDAS
        (Berkeley Data Analytics Stack)
                                                 Spark Streaming
      Bagel App                Shark App
                                                       App

         Bagel                   Shark           Spark Streaming   Spark App



                                            Spark
  Hadoop/HDFS

                                           Mesos

                                            Linux


Global Big Data Apr 23, 2013         technicaltidbit.com                       11
RDDs
         (Resilient Distributed Dataset)




                               Image from Matei Zaharia’s paper




Global Big Data Apr 23, 2013      technicaltidbit.com             12
RDDs: Laziness
                                                              x => x.startsWith(“ERROR”)
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
                               .map(_.split(‘t’)(2))                   All Lazy
                               .filter(_.contains(“foo”))
cnt = errors.count

                                    Action!




Global Big Data Apr 23, 2013            technicaltidbit.com                          13
RDDs: Transformations vs. Actions
  Transformations                                             Actions
  map(func)                                                   reduce(func)
  filter(func)                                                collect()
  flatMap(func)                                               count()
  sample(withReplacement,                                     take(n)
     frac, seed)                                              first()
  union(otherDataset)                                         saveAsTextFile(path)
  groupByKey[K,V](func)                                       saveAsSequenceFile(path)
  reduceByKey[K,V](func)                                      foreach(func)
  join[K,V,W](otherDataset)
  cogroup[K,V,W1,W2](other1,
     other2)
  cartesian[U](otherDataset)
  sortByKey[K,V]
                               [K,V] in Scala same as <K,V>
                               templates in C++, Java

Global Big Data Apr 23, 2013                technicaltidbit.com                          14
Hive vs. Shark

                                                       Shark
            HiveQL
            HiveQL




                                                         HiveQL
                                                         HiveQL
 HDFS files                          HDFS files
                                                         +        RDDs




Global Big Data Apr 23, 2013     technicaltidbit.com                     15
Shark: Copy from HDFS to RDD
CREATE TABLE wiki_small_in_mem TBLPROPERTIES
  ("shark.cache" = "true") AS SELECT * FROM wiki;

CREATE TABLE wiki_cached AS SELECT * FROM wiki;


Creates a table that is stored in a cluster’s
  memory using RDD.cache().


Global Big Data Apr 23, 2013   technicaltidbit.com   16
Shark: Just a Shim

                                                                     Shark




                                 Images from Reynold Xin’s presentation




Global Big Data Apr 23, 2013           technicaltidbit.com                   17
What about “Big Data”?


                                                     PB

                                                     TB




                                                          Shark Effectiveness
                                                          Shark Effectiveness
                                                     GB

                                                     MB

                                                     KB
Global Big Data Apr 23, 2013   technicaltidbit.com                              18
Median Hadoop job input size




                               Image from Reynold Xin’s presentation


Global Big Data Apr 23, 2013        technicaltidbit.com                19
Spark Streaming: Motivation




x1,000,000 clients
                                              HDFS




 Global Big Data Apr 23, 2013   technicaltidbit.com   20
Spark Streaming: DStream
• “A series of small batches”
  {{“id”: “hercman”},          {{“id”: “hercman”},
                                                          {{“id”: “shewolf”},
  “eventType”:                 “eventType”:
                                                          “eventType”: “error”}}   RDD   2 sec
  “buyGoods”}}                 “buyGoods”}}



  {{“id”: “shewolf”},
  “eventType”: “error”}}                                                           RDD   2 sec
                                                 ...

  {{“id”: “catlover”},
                               {{“id”: “hercman”},
  “eventType”:
                               “eventType”: “logOff”}}                             RDD   2 sec
  “buyGoods”}}


                                                     DStream
                                                      DStream

Global Big Data Apr 23, 2013                 technicaltidbit.com                           21
Spark Streaming: DAG
                                                                               DStream
                                                                                                Dstream
                                                                               .filter(
                                                                                                .foreach(
                                                                               _.eventType==
                                                                                                println)
                                                                        bj]    “error”)
                                                                    [EvO
                                                              am
                                                           tre
         DStream[String]             Dstream            Ds
Kafka                              .transform
             (JSON)                                   Ds
                                                         tr   eam
                                                                  [Ev
                                                                      Ob
                                                                        j]
                                                                              Dstream
                                                                                               Dstream
                                                                              .filter(
                                                                                               .foreach(
                                                                              _.eventType==
                                                                                               println)
                                                                              “buyGoods”)




                        The DAG                                               Dstream
                                                                              .map((_.id,1))
                                                                                               Dstream
                                                                                               .groupByKey


    Global Big Data Apr 23, 2013                technicaltidbit.com                                    22
Spark Streaming: Example Code
// Initialize
val ssc = new StreamingContext(“mesos://localhost”, “games”, Seconds(2), …)
val msgs = ssc.kafkaStream[String](prm, topic, StorageLevel.MEMORY_AND_DISK)

// DAG
val events:Dstream[evObj] = messages.transform(rdd => rdd.map(new evObj(_))

val errorCounts = events.filter(_.eventType == “error”)
errorCounts.foreach(rdd => println(rdd.count))

val usersBuying = events.filter(_.eventType == “buyGoods”).map((_.id,1))
                        .groupByKey
usersBuying.foreach(rdd => println(rdd.count))

// Go
ssc.start




Global Big Data Apr 23, 2013   technicaltidbit.com                         23
Stateful Spark Streaming
Class ErrorsPerUser(var numErrors:Int=0) extends Serializable
val updateFunc = (values:Seq[evObj], state:Option[ErrorsPerUser]) => {
    if (values.find(_.eventType == “logOff”) == None)
        None
    else {
        values.foreach(e => {
             e.eventType match { “error” => state.numErrors += 1 }
        })
        Option(state)
    }
}

// DAG
val events:Dstream[evObj] = messages.transform(rdd => rdd.map(new evObj(_))
val errorCounts = events.filter(_.eventType == “error”)
val states = errorCounts.map((_.id,1))
                        .updateStateByKey[ErrorsPerUser](updateFunc)

// Off-DAG
states.foreach(rdd => println(“Num users experiencing errors:” + rdd.count))



Global Big Data Apr 23, 2013   technicaltidbit.com                        24
Other Spark Subsystems
• Bagel (similar to Google Pregel)
• Sparkler (Matrix decomposition)
•              (Machine Learning)




Global Big Data Apr 23, 2013   technicaltidbit.com   25
Teaser
                                  • Future Meetup: Machine
                                    learning from real-time
                                    data streams




Global Big Data Apr 23, 2013   technicaltidbit.com        26

Spark 2013-04-17

  • 1.
    The Spark Ecosystem Michael Malak technicaltidbit.com
  • 2.
    Agenda • What Hadoop gives us • What everyone is complaining about in 2013 • Spark – Berkeley Team – BDAS (Berkeley Data Analytics Stack) – RDDs (Resilient Distributed Datasets) – Shark – Spark Streaming – Other Spark subsystems Global Big Data Apr 23, 2013 technicaltidbit.com 2
  • 3.
    What Hadoop GivesUs • HDFS • Map/Reduce Global Big Data Apr 23, 2013 technicaltidbit.com 3
  • 4.
    Hadoop: HDFS Image from mark.chmarny.com Global Big Data Apr 23, 2013 technicaltidbit.com 4
  • 5.
    Hadoop: Map/Reduce Image fromblog.octo.com Image from people.apache.org/~rdonkin Global Big Data Apr 23, 2013 technicaltidbit.com 5
  • 6.
    Map/Reduce Tools Pig Script HiveQL Hbase App Pig Hive Hadoop Linux Global Big Data Apr 23, 2013 technicaltidbit.com 6
  • 7.
    Hadoop Distribution Dogsin the Race Hadoop Distribution Query Tool Apache Drill Stinger Global Big Data Apr 23, 2013 technicaltidbit.com 7
  • 8.
    Other Open SourceSolutions • Druid • Spark Global Big Data Apr 23, 2013 technicaltidbit.com 8
  • 9.
    Not just caching,but streaming • 1st generation: HDFS • 2nd generation: Caching & “Push” Map/Reduce • 3rd generation: Streaming Global Big Data Apr 23, 2013 technicaltidbit.com 9
  • 10.
    Berkeley Team • 40students • 8 faculty • 3 staff software engineers • Silicon Valley style skunkworks office Image from Ian Stoica’s slides from Strata 2013 presentation space • 2 years into 6 year program Global Big Data Apr 23, 2013 technicaltidbit.com 10
  • 11.
    BDAS (Berkeley Data Analytics Stack) Spark Streaming Bagel App Shark App App Bagel Shark Spark Streaming Spark App Spark Hadoop/HDFS Mesos Linux Global Big Data Apr 23, 2013 technicaltidbit.com 11
  • 12.
    RDDs (Resilient Distributed Dataset) Image from Matei Zaharia’s paper Global Big Data Apr 23, 2013 technicaltidbit.com 12
  • 13.
    RDDs: Laziness x => x.startsWith(“ERROR”) lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) .map(_.split(‘t’)(2)) All Lazy .filter(_.contains(“foo”)) cnt = errors.count Action! Global Big Data Apr 23, 2013 technicaltidbit.com 13
  • 14.
    RDDs: Transformations vs.Actions Transformations Actions map(func) reduce(func) filter(func) collect() flatMap(func) count() sample(withReplacement, take(n) frac, seed) first() union(otherDataset) saveAsTextFile(path) groupByKey[K,V](func) saveAsSequenceFile(path) reduceByKey[K,V](func) foreach(func) join[K,V,W](otherDataset) cogroup[K,V,W1,W2](other1, other2) cartesian[U](otherDataset) sortByKey[K,V] [K,V] in Scala same as <K,V> templates in C++, Java Global Big Data Apr 23, 2013 technicaltidbit.com 14
  • 15.
    Hive vs. Shark Shark HiveQL HiveQL HiveQL HiveQL HDFS files HDFS files + RDDs Global Big Data Apr 23, 2013 technicaltidbit.com 15
  • 16.
    Shark: Copy fromHDFS to RDD CREATE TABLE wiki_small_in_mem TBLPROPERTIES ("shark.cache" = "true") AS SELECT * FROM wiki; CREATE TABLE wiki_cached AS SELECT * FROM wiki; Creates a table that is stored in a cluster’s memory using RDD.cache(). Global Big Data Apr 23, 2013 technicaltidbit.com 16
  • 17.
    Shark: Just aShim Shark Images from Reynold Xin’s presentation Global Big Data Apr 23, 2013 technicaltidbit.com 17
  • 18.
    What about “BigData”? PB TB Shark Effectiveness Shark Effectiveness GB MB KB Global Big Data Apr 23, 2013 technicaltidbit.com 18
  • 19.
    Median Hadoop jobinput size Image from Reynold Xin’s presentation Global Big Data Apr 23, 2013 technicaltidbit.com 19
  • 20.
    Spark Streaming: Motivation x1,000,000clients HDFS Global Big Data Apr 23, 2013 technicaltidbit.com 20
  • 21.
    Spark Streaming: DStream •“A series of small batches” {{“id”: “hercman”}, {{“id”: “hercman”}, {{“id”: “shewolf”}, “eventType”: “eventType”: “eventType”: “error”}} RDD 2 sec “buyGoods”}} “buyGoods”}} {{“id”: “shewolf”}, “eventType”: “error”}} RDD 2 sec ... {{“id”: “catlover”}, {{“id”: “hercman”}, “eventType”: “eventType”: “logOff”}} RDD 2 sec “buyGoods”}} DStream DStream Global Big Data Apr 23, 2013 technicaltidbit.com 21
  • 22.
    Spark Streaming: DAG DStream Dstream .filter( .foreach( _.eventType== println) bj] “error”) [EvO am tre DStream[String] Dstream Ds Kafka .transform (JSON) Ds tr eam [Ev Ob j] Dstream Dstream .filter( .foreach( _.eventType== println) “buyGoods”) The DAG Dstream .map((_.id,1)) Dstream .groupByKey Global Big Data Apr 23, 2013 technicaltidbit.com 22
  • 23.
    Spark Streaming: ExampleCode // Initialize val ssc = new StreamingContext(“mesos://localhost”, “games”, Seconds(2), …) val msgs = ssc.kafkaStream[String](prm, topic, StorageLevel.MEMORY_AND_DISK) // DAG val events:Dstream[evObj] = messages.transform(rdd => rdd.map(new evObj(_)) val errorCounts = events.filter(_.eventType == “error”) errorCounts.foreach(rdd => println(rdd.count)) val usersBuying = events.filter(_.eventType == “buyGoods”).map((_.id,1)) .groupByKey usersBuying.foreach(rdd => println(rdd.count)) // Go ssc.start Global Big Data Apr 23, 2013 technicaltidbit.com 23
  • 24.
    Stateful Spark Streaming ClassErrorsPerUser(var numErrors:Int=0) extends Serializable val updateFunc = (values:Seq[evObj], state:Option[ErrorsPerUser]) => { if (values.find(_.eventType == “logOff”) == None) None else { values.foreach(e => { e.eventType match { “error” => state.numErrors += 1 } }) Option(state) } } // DAG val events:Dstream[evObj] = messages.transform(rdd => rdd.map(new evObj(_)) val errorCounts = events.filter(_.eventType == “error”) val states = errorCounts.map((_.id,1)) .updateStateByKey[ErrorsPerUser](updateFunc) // Off-DAG states.foreach(rdd => println(“Num users experiencing errors:” + rdd.count)) Global Big Data Apr 23, 2013 technicaltidbit.com 24
  • 25.
    Other Spark Subsystems •Bagel (similar to Google Pregel) • Sparkler (Matrix decomposition) • (Machine Learning) Global Big Data Apr 23, 2013 technicaltidbit.com 25
  • 26.
    Teaser • Future Meetup: Machine learning from real-time data streams Global Big Data Apr 23, 2013 technicaltidbit.com 26