SlideShare a Scribd company logo
1 of 26
The Spark Ecosystem

       Michael Malak


   technicaltidbit.com
Agenda
•    What Hadoop gives us
•    What everyone is complaining about in 2013
•    Spark
       – Berkeley Team
       – BDAS (Berkeley Data Analytics Stack)
       – RDDs (Resilient Distributed Datasets)
       – Shark
       – Spark Streaming
       – Other Spark subsystems
Global Big Data Apr 23, 2013   technicaltidbit.com   2
What Hadoop Gives Us
• HDFS
• Map/Reduce




Global Big Data Apr 23, 2013   technicaltidbit.com   3
Hadoop: HDFS




                                 Image from mark.chmarny.com




Global Big Data Apr 23, 2013      technicaltidbit.com          4
Hadoop: Map/Reduce




Image from blog.octo.com




                                                        Image from people.apache.org/~rdonkin




   Global Big Data Apr 23, 2013   technicaltidbit.com                                    5
Map/Reduce Tools


          Pig Script                     HiveQL          Hbase App

              Pig                         Hive

                                        Hadoop

                                          Linux




Global Big Data Apr 23, 2013       technicaltidbit.com               6
Hadoop Distribution Dogs in the
                  Race
                Hadoop Distribution             Query Tool

                                                     Apache Drill




                                                Stinger



Global Big Data Apr 23, 2013   technicaltidbit.com                  7
Other Open Source Solutions
• Druid
• Spark




Global Big Data Apr 23, 2013   technicaltidbit.com   8
Not just caching, but streaming
•    1st generation: HDFS
•    2nd generation: Caching & “Push” Map/Reduce
•    3rd generation: Streaming




Global Big Data Apr 23, 2013   technicaltidbit.com   9
Berkeley Team
• 40 students
• 8 faculty
• 3 staff software
  engineers
• Silicon Valley style
  skunkworks office                      Image from Ian Stoica’s slides from Strata 2013 presentation
  space
• 2 years into 6 year
  program
 Global Big Data Apr 23, 2013      technicaltidbit.com                                            10
BDAS
        (Berkeley Data Analytics Stack)
                                                 Spark Streaming
      Bagel App                Shark App
                                                       App

         Bagel                   Shark           Spark Streaming   Spark App



                                            Spark
  Hadoop/HDFS

                                           Mesos

                                            Linux


Global Big Data Apr 23, 2013         technicaltidbit.com                       11
RDDs
         (Resilient Distributed Dataset)




                               Image from Matei Zaharia’s paper




Global Big Data Apr 23, 2013      technicaltidbit.com             12
RDDs: Laziness
                                                              x => x.startsWith(“ERROR”)
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
                               .map(_.split(‘t’)(2))                   All Lazy
                               .filter(_.contains(“foo”))
cnt = errors.count

                                    Action!




Global Big Data Apr 23, 2013            technicaltidbit.com                          13
RDDs: Transformations vs. Actions
  Transformations                                             Actions
  map(func)                                                   reduce(func)
  filter(func)                                                collect()
  flatMap(func)                                               count()
  sample(withReplacement,                                     take(n)
     frac, seed)                                              first()
  union(otherDataset)                                         saveAsTextFile(path)
  groupByKey[K,V](func)                                       saveAsSequenceFile(path)
  reduceByKey[K,V](func)                                      foreach(func)
  join[K,V,W](otherDataset)
  cogroup[K,V,W1,W2](other1,
     other2)
  cartesian[U](otherDataset)
  sortByKey[K,V]
                               [K,V] in Scala same as <K,V>
                               templates in C++, Java

Global Big Data Apr 23, 2013                technicaltidbit.com                          14
Hive vs. Shark

                                                       Shark
            HiveQL
            HiveQL




                                                         HiveQL
                                                         HiveQL
 HDFS files                          HDFS files
                                                         +        RDDs




Global Big Data Apr 23, 2013     technicaltidbit.com                     15
Shark: Copy from HDFS to RDD
CREATE TABLE wiki_small_in_mem TBLPROPERTIES
  ("shark.cache" = "true") AS SELECT * FROM wiki;

CREATE TABLE wiki_cached AS SELECT * FROM wiki;


Creates a table that is stored in a cluster’s
  memory using RDD.cache().


Global Big Data Apr 23, 2013   technicaltidbit.com   16
Shark: Just a Shim

                                                                     Shark




                                 Images from Reynold Xin’s presentation




Global Big Data Apr 23, 2013           technicaltidbit.com                   17
What about “Big Data”?


                                                     PB

                                                     TB




                                                          Shark Effectiveness
                                                          Shark Effectiveness
                                                     GB

                                                     MB

                                                     KB
Global Big Data Apr 23, 2013   technicaltidbit.com                              18
Median Hadoop job input size




                               Image from Reynold Xin’s presentation


Global Big Data Apr 23, 2013        technicaltidbit.com                19
Spark Streaming: Motivation




x1,000,000 clients
                                              HDFS




 Global Big Data Apr 23, 2013   technicaltidbit.com   20
Spark Streaming: DStream
• “A series of small batches”
  {{“id”: “hercman”},          {{“id”: “hercman”},
                                                          {{“id”: “shewolf”},
  “eventType”:                 “eventType”:
                                                          “eventType”: “error”}}   RDD   2 sec
  “buyGoods”}}                 “buyGoods”}}



  {{“id”: “shewolf”},
  “eventType”: “error”}}                                                           RDD   2 sec
                                                 ...

  {{“id”: “catlover”},
                               {{“id”: “hercman”},
  “eventType”:
                               “eventType”: “logOff”}}                             RDD   2 sec
  “buyGoods”}}


                                                     DStream
                                                      DStream

Global Big Data Apr 23, 2013                 technicaltidbit.com                           21
Spark Streaming: DAG
                                                                               DStream
                                                                                                Dstream
                                                                               .filter(
                                                                                                .foreach(
                                                                               _.eventType==
                                                                                                println)
                                                                        bj]    “error”)
                                                                    [EvO
                                                              am
                                                           tre
         DStream[String]             Dstream            Ds
Kafka                              .transform
             (JSON)                                   Ds
                                                         tr   eam
                                                                  [Ev
                                                                      Ob
                                                                        j]
                                                                              Dstream
                                                                                               Dstream
                                                                              .filter(
                                                                                               .foreach(
                                                                              _.eventType==
                                                                                               println)
                                                                              “buyGoods”)




                        The DAG                                               Dstream
                                                                              .map((_.id,1))
                                                                                               Dstream
                                                                                               .groupByKey


    Global Big Data Apr 23, 2013                technicaltidbit.com                                    22
Spark Streaming: Example Code
// Initialize
val ssc = new StreamingContext(“mesos://localhost”, “games”, Seconds(2), …)
val msgs = ssc.kafkaStream[String](prm, topic, StorageLevel.MEMORY_AND_DISK)

// DAG
val events:Dstream[evObj] = messages.transform(rdd => rdd.map(new evObj(_))

val errorCounts = events.filter(_.eventType == “error”)
errorCounts.foreach(rdd => println(rdd.count))

val usersBuying = events.filter(_.eventType == “buyGoods”).map((_.id,1))
                        .groupByKey
usersBuying.foreach(rdd => println(rdd.count))

// Go
ssc.start




Global Big Data Apr 23, 2013   technicaltidbit.com                         23
Stateful Spark Streaming
Class ErrorsPerUser(var numErrors:Int=0) extends Serializable
val updateFunc = (values:Seq[evObj], state:Option[ErrorsPerUser]) => {
    if (values.find(_.eventType == “logOff”) == None)
        None
    else {
        values.foreach(e => {
             e.eventType match { “error” => state.numErrors += 1 }
        })
        Option(state)
    }
}

// DAG
val events:Dstream[evObj] = messages.transform(rdd => rdd.map(new evObj(_))
val errorCounts = events.filter(_.eventType == “error”)
val states = errorCounts.map((_.id,1))
                        .updateStateByKey[ErrorsPerUser](updateFunc)

// Off-DAG
states.foreach(rdd => println(“Num users experiencing errors:” + rdd.count))



Global Big Data Apr 23, 2013   technicaltidbit.com                        24
Other Spark Subsystems
• Bagel (similar to Google Pregel)
• Sparkler (Matrix decomposition)
•              (Machine Learning)




Global Big Data Apr 23, 2013   technicaltidbit.com   25
Teaser
                                  • Future Meetup: Machine
                                    learning from real-time
                                    data streams




Global Big Data Apr 23, 2013   technicaltidbit.com        26

More Related Content

What's hot

Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Gabriele Modena
 
Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Robert Grossman
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce FrameworkEdureka!
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopGiovanna Roda
 
What is Hadoop?
What is Hadoop?What is Hadoop?
What is Hadoop?cneudecker
 
Big Data Tools : PAST, NOW and FUTURE
Big Data Tools : PAST, NOW and FUTUREBig Data Tools : PAST, NOW and FUTURE
Big Data Tools : PAST, NOW and FUTUREJazz Yao-Tsung Wang
 
Final Year Project Guidance
Final Year Project GuidanceFinal Year Project Guidance
Final Year Project GuidanceVarad Meru
 
Platforms for data science
Platforms for data sciencePlatforms for data science
Platforms for data scienceDeepak Singh
 
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...Geoffrey Fox
 
Lily @ Work Webinar
Lily @ Work WebinarLily @ Work Webinar
Lily @ Work WebinarNGDATA
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataJetlore
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop DeveloperEdureka!
 
Large Scale Data With Hadoop
Large Scale Data With HadoopLarge Scale Data With Hadoop
Large Scale Data With Hadoopguest27e6764
 
Big Process for Big Data @ NASA
Big Process for Big Data @ NASABig Process for Big Data @ NASA
Big Process for Big Data @ NASAIan Foster
 
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Cloudera, Inc.
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xNPN Training
 

What's hot (20)

Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2
 
Hadoop
HadoopHadoop
Hadoop
 
Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)
 
Spark and shark
Spark and sharkSpark and shark
Spark and shark
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
 
Big Data: hype or necessity?
Big Data: hype or necessity?Big Data: hype or necessity?
Big Data: hype or necessity?
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
What is Hadoop?
What is Hadoop?What is Hadoop?
What is Hadoop?
 
Big Data Tools : PAST, NOW and FUTURE
Big Data Tools : PAST, NOW and FUTUREBig Data Tools : PAST, NOW and FUTURE
Big Data Tools : PAST, NOW and FUTURE
 
Final Year Project Guidance
Final Year Project GuidanceFinal Year Project Guidance
Final Year Project Guidance
 
Platforms for data science
Platforms for data sciencePlatforms for data science
Platforms for data science
 
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
 
Lily @ Work Webinar
Lily @ Work WebinarLily @ Work Webinar
Lily @ Work Webinar
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop Developer
 
Large Scale Data With Hadoop
Large Scale Data With HadoopLarge Scale Data With Hadoop
Large Scale Data With Hadoop
 
Big Process for Big Data @ NASA
Big Process for Big Data @ NASABig Process for Big Data @ NASA
Big Process for Big Data @ NASA
 
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
 

Similar to Spark 2013-04-17

Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridEvert Lammerts
 
Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupCloudera, Inc.
 
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...Cedric CARBONE
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
Big Data with Modern R & Spark
Big Data with Modern R & SparkBig Data with Modern R & Spark
Big Data with Modern R & SparkXavier de Pedro
 
A gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopA gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopStefano Paluello
 
Architecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataArchitecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataRichard McDougall
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoMark Kromer
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkManish Gupta
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsRichard McDougall
 
A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"Nicola Ferraro
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueShay Sofer
 
Scientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchScientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchDirk Petersen
 
Druid meetup 4th_sql_on_druid
Druid meetup 4th_sql_on_druidDruid meetup 4th_sql_on_druid
Druid meetup 4th_sql_on_druidYousun Jeong
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
 

Similar to Spark 2013-04-17 (20)

Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's Group
 
Bids talk 9.18
Bids talk 9.18Bids talk 9.18
Bids talk 9.18
 
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Big Data with Modern R & Spark
Big Data with Modern R & SparkBig Data with Modern R & Spark
Big Data with Modern R & Spark
 
A gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopA gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and Hadoop
 
Architecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataArchitecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big Data
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
Tese phd
Tese phdTese phd
Tese phd
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Big data analytics_7_giants_public_24_sep_2013
Big data analytics_7_giants_public_24_sep_2013Big data analytics_7_giants_public_24_sep_2013
Big data analytics_7_giants_public_24_sep_2013
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the Rescue
 
Scientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchScientific Computing @ Fred Hutch
Scientific Computing @ Fred Hutch
 
Druid meetup 4th_sql_on_druid
Druid meetup 4th_sql_on_druidDruid meetup 4th_sql_on_druid
Druid meetup 4th_sql_on_druid
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012
 

Recently uploaded

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 

Spark 2013-04-17

  • 1. The Spark Ecosystem Michael Malak technicaltidbit.com
  • 2. Agenda • What Hadoop gives us • What everyone is complaining about in 2013 • Spark – Berkeley Team – BDAS (Berkeley Data Analytics Stack) – RDDs (Resilient Distributed Datasets) – Shark – Spark Streaming – Other Spark subsystems Global Big Data Apr 23, 2013 technicaltidbit.com 2
  • 3. What Hadoop Gives Us • HDFS • Map/Reduce Global Big Data Apr 23, 2013 technicaltidbit.com 3
  • 4. Hadoop: HDFS Image from mark.chmarny.com Global Big Data Apr 23, 2013 technicaltidbit.com 4
  • 5. Hadoop: Map/Reduce Image from blog.octo.com Image from people.apache.org/~rdonkin Global Big Data Apr 23, 2013 technicaltidbit.com 5
  • 6. Map/Reduce Tools Pig Script HiveQL Hbase App Pig Hive Hadoop Linux Global Big Data Apr 23, 2013 technicaltidbit.com 6
  • 7. Hadoop Distribution Dogs in the Race Hadoop Distribution Query Tool Apache Drill Stinger Global Big Data Apr 23, 2013 technicaltidbit.com 7
  • 8. Other Open Source Solutions • Druid • Spark Global Big Data Apr 23, 2013 technicaltidbit.com 8
  • 9. Not just caching, but streaming • 1st generation: HDFS • 2nd generation: Caching & “Push” Map/Reduce • 3rd generation: Streaming Global Big Data Apr 23, 2013 technicaltidbit.com 9
  • 10. Berkeley Team • 40 students • 8 faculty • 3 staff software engineers • Silicon Valley style skunkworks office Image from Ian Stoica’s slides from Strata 2013 presentation space • 2 years into 6 year program Global Big Data Apr 23, 2013 technicaltidbit.com 10
  • 11. BDAS (Berkeley Data Analytics Stack) Spark Streaming Bagel App Shark App App Bagel Shark Spark Streaming Spark App Spark Hadoop/HDFS Mesos Linux Global Big Data Apr 23, 2013 technicaltidbit.com 11
  • 12. RDDs (Resilient Distributed Dataset) Image from Matei Zaharia’s paper Global Big Data Apr 23, 2013 technicaltidbit.com 12
  • 13. RDDs: Laziness x => x.startsWith(“ERROR”) lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) .map(_.split(‘t’)(2)) All Lazy .filter(_.contains(“foo”)) cnt = errors.count Action! Global Big Data Apr 23, 2013 technicaltidbit.com 13
  • 14. RDDs: Transformations vs. Actions Transformations Actions map(func) reduce(func) filter(func) collect() flatMap(func) count() sample(withReplacement, take(n) frac, seed) first() union(otherDataset) saveAsTextFile(path) groupByKey[K,V](func) saveAsSequenceFile(path) reduceByKey[K,V](func) foreach(func) join[K,V,W](otherDataset) cogroup[K,V,W1,W2](other1, other2) cartesian[U](otherDataset) sortByKey[K,V] [K,V] in Scala same as <K,V> templates in C++, Java Global Big Data Apr 23, 2013 technicaltidbit.com 14
  • 15. Hive vs. Shark Shark HiveQL HiveQL HiveQL HiveQL HDFS files HDFS files + RDDs Global Big Data Apr 23, 2013 technicaltidbit.com 15
  • 16. Shark: Copy from HDFS to RDD CREATE TABLE wiki_small_in_mem TBLPROPERTIES ("shark.cache" = "true") AS SELECT * FROM wiki; CREATE TABLE wiki_cached AS SELECT * FROM wiki; Creates a table that is stored in a cluster’s memory using RDD.cache(). Global Big Data Apr 23, 2013 technicaltidbit.com 16
  • 17. Shark: Just a Shim Shark Images from Reynold Xin’s presentation Global Big Data Apr 23, 2013 technicaltidbit.com 17
  • 18. What about “Big Data”? PB TB Shark Effectiveness Shark Effectiveness GB MB KB Global Big Data Apr 23, 2013 technicaltidbit.com 18
  • 19. Median Hadoop job input size Image from Reynold Xin’s presentation Global Big Data Apr 23, 2013 technicaltidbit.com 19
  • 20. Spark Streaming: Motivation x1,000,000 clients HDFS Global Big Data Apr 23, 2013 technicaltidbit.com 20
  • 21. Spark Streaming: DStream • “A series of small batches” {{“id”: “hercman”}, {{“id”: “hercman”}, {{“id”: “shewolf”}, “eventType”: “eventType”: “eventType”: “error”}} RDD 2 sec “buyGoods”}} “buyGoods”}} {{“id”: “shewolf”}, “eventType”: “error”}} RDD 2 sec ... {{“id”: “catlover”}, {{“id”: “hercman”}, “eventType”: “eventType”: “logOff”}} RDD 2 sec “buyGoods”}} DStream DStream Global Big Data Apr 23, 2013 technicaltidbit.com 21
  • 22. Spark Streaming: DAG DStream Dstream .filter( .foreach( _.eventType== println) bj] “error”) [EvO am tre DStream[String] Dstream Ds Kafka .transform (JSON) Ds tr eam [Ev Ob j] Dstream Dstream .filter( .foreach( _.eventType== println) “buyGoods”) The DAG Dstream .map((_.id,1)) Dstream .groupByKey Global Big Data Apr 23, 2013 technicaltidbit.com 22
  • 23. Spark Streaming: Example Code // Initialize val ssc = new StreamingContext(“mesos://localhost”, “games”, Seconds(2), …) val msgs = ssc.kafkaStream[String](prm, topic, StorageLevel.MEMORY_AND_DISK) // DAG val events:Dstream[evObj] = messages.transform(rdd => rdd.map(new evObj(_)) val errorCounts = events.filter(_.eventType == “error”) errorCounts.foreach(rdd => println(rdd.count)) val usersBuying = events.filter(_.eventType == “buyGoods”).map((_.id,1)) .groupByKey usersBuying.foreach(rdd => println(rdd.count)) // Go ssc.start Global Big Data Apr 23, 2013 technicaltidbit.com 23
  • 24. Stateful Spark Streaming Class ErrorsPerUser(var numErrors:Int=0) extends Serializable val updateFunc = (values:Seq[evObj], state:Option[ErrorsPerUser]) => { if (values.find(_.eventType == “logOff”) == None) None else { values.foreach(e => { e.eventType match { “error” => state.numErrors += 1 } }) Option(state) } } // DAG val events:Dstream[evObj] = messages.transform(rdd => rdd.map(new evObj(_)) val errorCounts = events.filter(_.eventType == “error”) val states = errorCounts.map((_.id,1)) .updateStateByKey[ErrorsPerUser](updateFunc) // Off-DAG states.foreach(rdd => println(“Num users experiencing errors:” + rdd.count)) Global Big Data Apr 23, 2013 technicaltidbit.com 24
  • 25. Other Spark Subsystems • Bagel (similar to Google Pregel) • Sparkler (Matrix decomposition) • (Machine Learning) Global Big Data Apr 23, 2013 technicaltidbit.com 25
  • 26. Teaser • Future Meetup: Machine learning from real-time data streams Global Big Data Apr 23, 2013 technicaltidbit.com 26