SlideShare a Scribd company logo
Spark and Shark
High-Speed In-Memory Analytics
over Hadoop and Hive Data
Matei Zaharia, in collaboration with
Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff
Engle, Michael Franklin, Haoyuan Li, Antonio Lupher, Justin
Ma, Murphy McCauley, Scott Shenker, Ion Stoica, Reynold Xin

UC Berkeley
spark-project.org                               UC BERKELEY
What is Spark?
Not a modified version of Hadoop
Separate, fast, MapReduce-like engine
 » In-memory data storage for very fast iterative queries
 » General execution graphs and powerful optimizations
 » Up to 40x faster than Hadoop

Compatible with Hadoop’s storage APIs
 » Can read/write to any Hadoop-supported system,
   including HDFS, HBase, SequenceFiles, etc
What is Shark?
Port of Apache Hive to run on Spark
Compatible with existing Hive data, metastores,
and queries (HiveQL, UDFs, etc)
Similar speedups of up to 40x
Project History
Spark project started in 2009, open sourced 2010
Shark started summer 2011, alpha April 2012
In use at Berkeley, Princeton, Klout, Foursquare,
Conviva, Quantifind, Yahoo! Research & others
250+ member meetup, 500+ watchers on GitHub
This Talk
Spark programming model
User applications
Shark overview
Demo
Next major addition: Streaming Spark
Why a New Programming Model?
MapReduce greatly simplified big data analysis
But as soon as it got popular, users wanted more:
 » More complex, multi-stage applications (graph
   algorithms, machine learning)
 » More interactive ad-hoc queries
 » More real-time online processing

All three of these apps require fast data sharing
across parallel jobs
Data Sharing in MapReduce
          HDFS             HDFS             HDFS             HDFS
          read             write            read             write
                 iter. 1                           iter. 2               . . .

  Input

           HDFS                    query 1                    result 1
           read
                                   query 2                    result 2


                                   query 3                    result 3
  Input
                                    . . .

Slow due to replication, serialization, and disk IO
Data Sharing in Spark

                iter. 1         iter. 2     . . .

  Input

                                query 1
           one-time
          processing
                                query 2

                                query 3
  Input           Distributed
                   memory        . . .

     10-100× faster than network and disk
Spark Programming Model
Key idea: resilient distributed datasets (RDDs)
 » Distributed collections of objects that can be cached
   in memory across cluster nodes
 » Manipulated through various parallel operators
 » Automatically rebuilt on failure

Interface
 » Clean language-integrated API in Scala
 » Can be used interactively from Scala console
Example: Log Mining
 Load error messages from a log into memory, then
 interactively search for various patterns
                                          BaseTransformed RDD
                                               RDD                           Cache 1
lines = spark.textFile(“hdfs://...”)                                   Worker
                                                         results
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(„t‟)(2))                        tasks    Block 1
                                                  Driver
cachedMsgs = messages.cache()
                                                  Action
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count                                  Cache 2
                                                                       Worker
. . .
                                                    Cache 3
                                               Worker                  Block 2
 Result: scaled tosearch of Wikipedia
         full-text 1 TB data in 5-7 sec
 in <1 sec (vs 20 for on-disk data)
     (vs 170 sec sec for on-disk data)         Block 3
Fault Tolerance
RDDs track the series of transformations used to
build them (their lineage) to recompute lost data
E.g: messages      = textFile(...).filter(_.contains(“error”))
                                  .map(_.split(„t‟)(2))




    HadoopRDD               FilteredRDD             MappedRDD
     path = hdfs://…       func = _.contains(...)   func = _.split(…)
Example: Logistic Regression
val data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)                   Load data in memory once
                             Initial parameter vector
for (i <- 1 to ITERATIONS) {
  val gradient = data.map(p =>
    (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
  ).reduce(_ + _)
  w -= gradient              Repeated MapReduce steps
}                              to do gradient descent

println("Final w: " + w)
Logistic Regression Performance
                   4500
                   4000
                   3500                                      127 s / iteration
Running Time (s)




                   3000
                   2500
                                                            Hadoop
                   2000
                   1500                                     Spark
                   1000
                    500
                      0                                    first iteration 174 s
                                                          further iterations 6 s
                          1    5      10       20    30
                              Number of Iterations
Supported Operators
map              reduce        sample

filter           count         cogroup

groupBy          reduceByKey   take

sort             groupByKey    partitionBy

join             first         pipe

leftOuterJoin    union         save

rightOuterJoin   cross         ...
Other Engine Features
General graphs of operators ( efficiency)

         A:               B:


                                   G:
    Stage 1          groupBy

   C:           D:         F:

              map                            = RDD
                E:              join
                                             = cached data
    Stage 2             union      Stage 3
Other Engine Features
Controllable data partitioning to minimize
communication
                              PageRank Performance
                        200    171               Hadoop
   Iteration time (s)




                        150
                                                 Basic Spark
                        100          72
                                                 Spark + Controlled
                         50               23     Partitioning
                         0
User Applications
Spark Users
Applications
In-memory analytics & anomaly detection (Conviva)
Interactive queries on data streams (Quantifind)
Exploratory log analysis (Foursquare)
Traffic estimation w/ GPS data (Mobile Millennium)
Twitter spam classification (Monarch)
...
Conviva GeoReport
 Hive                                20

Spark       0.5
                                         Time (hours)
        0         5   10    15      20


Group aggregations on many keys w/ same filter
40× gain over Hive from avoiding repeated
reading, deserialization and filtering
Quantifind Feed Analysis

               Parsed    Extracted   In-Memory
Data Feeds                                                   Web
             Documents    Entities   Time Series
                                                    Spark
                                                             App
                                                   queries



Load data feeds, extract entities, and compute
in-memory tables every few minutes
Let users drill down interactively from AJAX app
Mobile Millennium Project
Estimate city traffic from crowdsourced GPS data
                                                 Iterative EM algorithm
                                                 scaling to 160 nodes




       Credit: Tim Hunter, with support of the Mobile Millennium team; P.I. Alex Bayen; traffic.berkeley.edu
Shark: Hive on Spark
Motivation
Hive is great, but Hadoop’s execution engine
makes even the smallest queries take minutes
Scala is good for programmers, but many data
users only know SQL
Can we extend Hive to run on Spark?
Hive Architecture
         Client            CLI          JDBC
                         Driver
 Meta                             Physical Plan
 store     SQL    Query
          Parser Optimizer         Execution

                    MapReduce

                  HDFS
Shark Architecture
         Client            CLI          JDBC
                         Driver     Cache Mgr.
 Meta                             Physical Plan
 store     SQL    Query
          Parser Optimizer         Execution

                         Spark

                  HDFS
                              [Engle et al, SIGMOD 2012]
Efficient In-Memory Storage
Simply caching Hive records as Java objects is
inefficient due to high per-object overhead
Instead, Shark employs column-oriented
storage using arrays of primitive types
        Row Storage        Column Storage
        1   john    4.1      1      2     3

        2   mike    3.5     john   mike sally

        3   sally   6.4     4.1    3.5   6.4
Efficient In-Memory Storage
Simply caching Hive records as Java objects is
inefficient due to high per-object overhead
Instead, Shark employs column-oriented
storage using arrays of primitive types
        Row Storage         Column Storage
         1   john    4.1      1     2     3
Benefit: similarly compact size to serialized data,
         2   but >5x faster to access sally
            mike 3.5          john mike

         3   sally   6.4      4.1   3.5   6.4
Using Shark
CREATE TABLE mydata_cached AS SELECT …

Run standard HiveQL on it, including UDFs
 » A few esoteric features are not yet supported

Can also call from Scala to mix with Spark


  Early alpha release at shark.cs.berkeley.edu
Benchmark Query 1
SELECT * FROM grep WHERE field LIKE „%XYZ%‟;



  Shark (cached) 12s



          Shark                                     182s




           Hive                                            207s



                  0    50    100            150            200    250
                            Execution Time (secs)
Benchmark Query 2
SELECT sourceIP, AVG(pageRank), SUM(adRevenue) AS earnings
FROM rankings AS R, userVisits AS V ON R.pageURL = V.destURL
WHERE V.visitDate BETWEEN „1999-01-01‟ AND „2000-01-01‟
GROUP BY V.sourceIP
ORDER BY earnings DESC
LIMIT 1;


   Shark (cached)        126s




           Shark                         270s




            Hive                                              447s



                    0   100      200            300     400          500
                                Execution Time (secs)
Demo
What’s Next
Streaming Spark
Many key big data apps must run in real time
 » Live event reporting, click analysis, spam filtering, …

Event-passing systems (e.g. Storm) are low-level
 » Users must worry about FT, state, consistency
 » Programming model is different from batch, so must
   write each app twice

Can we give streaming a Spark-like interface?
Our Idea
Run streaming computations as a series of very
short (<1 second) batch jobs
 » “Discretized stream processing”

Keep state in memory as RDDs (automatically
recover from any failure)
Provide a functional API similar to Spark
Spark Streaming API
 Functional operators on discretized streams
 New “stateful” operators for windowing
                                            pageViews    ones        counts
pageViews = readStream("...", "1s")
                                          t = 1:
ones = pageViews.map(ev => (ev.url, 1))
                                                   map      reduce
counts = ones.runningReduce(_ + _)
                                          t = 2:
D-streams      Transformation
sliding = ones.reduceByWindow(
              “5s”, _ + _, _ - _)                                      ...
                                                    = RDD       = partition
     Sliding window reduce with
    “add” and “subtract” functions
Streaming + Batch + Ad-Hoc
Combining D-streams with historical data:
   pageViews.join(historicCounts).map(...)



Interactive ad-hoc queries on stream state:
   counts.slice(“21:00”, “21:05”).topK(10)
How Fast Can It Go?
Can process 4 GB/s (42M records/s) of data on 100
nodes at sub-second latency
Recovers from failures within 1 sec
                         5                              5
                                 Grep                           TopKWords
    Cluster Throughput




                         4                              4
                         3                              3
           (GB/s)




                         2                              2
                         1                              1
                         0              1 sec   2 sec   0
                             0          50       100        0          50   100

                     Maximum throughput possible with 1s or 2s latency
Performance vs Storm
                           Spark     Storm                                   Spark      Storm
                  60                                                30
Grep Throughput




                                                  TopK Throughput
  (MB/s/node)




                                                    (MB/s/node)
                  40                                                20

                  20                                                10

                   0                                                0
                       10000                100                          10000                 100
                         Record Size (bytes)                               Record Size (bytes)


        Storm limited to 10,000 records/s/node
        Also tried S4: 7000 records/s/node
Streaming Roadmap
Alpha release expected in August
Spark engine changes already in “dev” branch
Conclusion
Spark & Shark speed up your interactive, complex,
and (soon) streaming analytics on Hadoop data
Download and docs: www.spark-project.org
 » Easy local mode and deploy scripts for EC2

User meetup: meetup.com/spark-users
Training camp at Berkeley in August!

                      matei@berkeley.edu / @matei_zaharia
Behavior with Not Enough RAM
                     100
                              68.8
Iteration time (s)




                                       58.1
                     80




                                                  40.7
                     60




                                                            29.7
                     40




                                                                     11.5
                     20
                      0
                            Cache     25%        50%       75%      Fully
                           disabled                                cached
                                      % of working set in memory
Software Stack
     Shark                Bagel         Streaming
  (Hive on Spark)   (Pregel on Spark)     Spark
                                                    …
                           Spark

   Local                          Apache
                    EC2                       YARN
   mode                           Mesos

More Related Content

What's hot

Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
Rahul Jain
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
David Lauzon
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012
Skills Matter Talks
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How
mcsrivas
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
Sandeep Deshmukh
 
Spark vstez
Spark vstezSpark vstez
Spark vstez
David Groozman
 
Apache Spark Overview part2 (20161117)
Apache Spark Overview part2 (20161117)Apache Spark Overview part2 (20161117)
Apache Spark Overview part2 (20161117)
Steve Min
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
MapR Technologies
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
Mohamed Elsaka
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
Xuan-Chao Huang
 
Spark overview
Spark overviewSpark overview
Spark overview
Lisa Hua
 
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperHadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS Developer
DataWorks Summit
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime Internals
Cheng Lian
 
Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) Model
Dean Wampler
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)
Steve Min
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
Adarsh Pannu
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
Corley S.r.l.
 
Apache Spark
Apache Spark Apache Spark
Apache Spark
Majid Hajibaba
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 

What's hot (20)

Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Spark vstez
Spark vstezSpark vstez
Spark vstez
 
Apache Spark Overview part2 (20161117)
Apache Spark Overview part2 (20161117)Apache Spark Overview part2 (20161117)
Apache Spark Overview part2 (20161117)
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperHadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS Developer
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime Internals
 
Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) Model
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Apache Spark
Apache Spark Apache Spark
Apache Spark
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 

Viewers also liked

Spark streaming
Spark streamingSpark streaming
Spark streaming
Noam Shaish
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
Impetus Technologies
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
Julian Hyde
 
Scala overview
Scala overviewScala overview
Scala overview
Steve Min
 
New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2
DataWorks Summit
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
Databricks
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
Databricks
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
Martin Zapletal
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks
 
Apache hbase overview (20160427)
Apache hbase overview (20160427)Apache hbase overview (20160427)
Apache hbase overview (20160427)
Steve Min
 
[Spark meetup] Spark Streaming Overview
[Spark meetup] Spark Streaming Overview[Spark meetup] Spark Streaming Overview
[Spark meetup] Spark Streaming Overview
Stratio
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
Paco Nathan
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?
Paco Nathan
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case Studies
Paco Nathan
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
Reynold Xin
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
Building Robust, Adaptive Streaming Apps with Spark Streaming
Building Robust, Adaptive Streaming Apps with Spark StreamingBuilding Robust, Adaptive Streaming Apps with Spark Streaming
Building Robust, Adaptive Streaming Apps with Spark Streaming
Databricks
 
Class ppt intro to r
Class ppt intro to rClass ppt intro to r
Class ppt intro to r
JigsawAcademy2014
 
빅데이터 기술 현황과 시장 전망(2014)
빅데이터 기술 현황과 시장 전망(2014)빅데이터 기술 현황과 시장 전망(2014)
빅데이터 기술 현황과 시장 전망(2014)
Channy Yun
 
R programming
R programmingR programming
R programming
Shantanu Patil
 

Viewers also liked (20)

Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
 
Scala overview
Scala overviewScala overview
Scala overview
 
New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
Apache hbase overview (20160427)
Apache hbase overview (20160427)Apache hbase overview (20160427)
Apache hbase overview (20160427)
 
[Spark meetup] Spark Streaming Overview
[Spark meetup] Spark Streaming Overview[Spark meetup] Spark Streaming Overview
[Spark meetup] Spark Streaming Overview
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case Studies
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Building Robust, Adaptive Streaming Apps with Spark Streaming
Building Robust, Adaptive Streaming Apps with Spark StreamingBuilding Robust, Adaptive Streaming Apps with Spark Streaming
Building Robust, Adaptive Streaming Apps with Spark Streaming
 
Class ppt intro to r
Class ppt intro to rClass ppt intro to r
Class ppt intro to r
 
빅데이터 기술 현황과 시장 전망(2014)
빅데이터 기술 현황과 시장 전망(2014)빅데이터 기술 현황과 시장 전망(2014)
빅데이터 기술 현황과 시장 전망(2014)
 
R programming
R programmingR programming
R programming
 

Similar to Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
mjfrankli
 
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
Amazon Web Services
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
Making Big Data Analytics Interactive and Real-­Time
 Making Big Data Analytics Interactive and Real-­Time Making Big Data Analytics Interactive and Real-­Time
Making Big Data Analytics Interactive and Real-­Time
Seven Nguyen
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
C4Media
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
Databricks
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
Databricks
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
Giivee The
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
Vienna Data Science Group
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
Wisely chen
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin
caidezhi655
 
Scala+data
Scala+dataScala+data
Scala+data
Samir Bessalah
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
WalmirCouto3
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
attilacsordas
 
Scaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter ExperienceScaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter Experience
DataWorks Summit
 
Ray and Its Growing Ecosystem
Ray and Its Growing EcosystemRay and Its Growing Ecosystem
Ray and Its Growing Ecosystem
Databricks
 
Spark training-in-bangalore
Spark training-in-bangaloreSpark training-in-bangalore
Spark training-in-bangalore
Kelly Technologies
 
Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's Group
Cloudera, Inc.
 

Similar to Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data (20)

Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
 
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Making Big Data Analytics Interactive and Real-­Time
 Making Big Data Analytics Interactive and Real-­Time Making Big Data Analytics Interactive and Real-­Time
Making Big Data Analytics Interactive and Real-­Time
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin
 
Scala+data
Scala+dataScala+data
Scala+data
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
 
Scaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter ExperienceScaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter Experience
 
Ray and Its Growing Ecosystem
Ray and Its Growing EcosystemRay and Its Growing Ecosystem
Ray and Its Growing Ecosystem
 
Spark training-in-bangalore
Spark training-in-bangaloreSpark training-in-bangalore
Spark training-in-bangalore
 
Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's Group
 

Recently uploaded

Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 

Recently uploaded (20)

Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 

Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

  • 1. Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li, Antonio Lupher, Justin Ma, Murphy McCauley, Scott Shenker, Ion Stoica, Reynold Xin UC Berkeley spark-project.org UC BERKELEY
  • 2. What is Spark? Not a modified version of Hadoop Separate, fast, MapReduce-like engine » In-memory data storage for very fast iterative queries » General execution graphs and powerful optimizations » Up to 40x faster than Hadoop Compatible with Hadoop’s storage APIs » Can read/write to any Hadoop-supported system, including HDFS, HBase, SequenceFiles, etc
  • 3. What is Shark? Port of Apache Hive to run on Spark Compatible with existing Hive data, metastores, and queries (HiveQL, UDFs, etc) Similar speedups of up to 40x
  • 4. Project History Spark project started in 2009, open sourced 2010 Shark started summer 2011, alpha April 2012 In use at Berkeley, Princeton, Klout, Foursquare, Conviva, Quantifind, Yahoo! Research & others 250+ member meetup, 500+ watchers on GitHub
  • 5. This Talk Spark programming model User applications Shark overview Demo Next major addition: Streaming Spark
  • 6. Why a New Programming Model? MapReduce greatly simplified big data analysis But as soon as it got popular, users wanted more: » More complex, multi-stage applications (graph algorithms, machine learning) » More interactive ad-hoc queries » More real-time online processing All three of these apps require fast data sharing across parallel jobs
  • 7. Data Sharing in MapReduce HDFS HDFS HDFS HDFS read write read write iter. 1 iter. 2 . . . Input HDFS query 1 result 1 read query 2 result 2 query 3 result 3 Input . . . Slow due to replication, serialization, and disk IO
  • 8. Data Sharing in Spark iter. 1 iter. 2 . . . Input query 1 one-time processing query 2 query 3 Input Distributed memory . . . 10-100× faster than network and disk
  • 9. Spark Programming Model Key idea: resilient distributed datasets (RDDs) » Distributed collections of objects that can be cached in memory across cluster nodes » Manipulated through various parallel operators » Automatically rebuilt on failure Interface » Clean language-integrated API in Scala » Can be used interactively from Scala console
  • 10. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns BaseTransformed RDD RDD Cache 1 lines = spark.textFile(“hdfs://...”) Worker results errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(„t‟)(2)) tasks Block 1 Driver cachedMsgs = messages.cache() Action cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count Cache 2 Worker . . . Cache 3 Worker Block 2 Result: scaled tosearch of Wikipedia full-text 1 TB data in 5-7 sec in <1 sec (vs 20 for on-disk data) (vs 170 sec sec for on-disk data) Block 3
  • 11. Fault Tolerance RDDs track the series of transformations used to build them (their lineage) to recompute lost data E.g: messages = textFile(...).filter(_.contains(“error”)) .map(_.split(„t‟)(2)) HadoopRDD FilteredRDD MappedRDD path = hdfs://… func = _.contains(...) func = _.split(…)
  • 12. Example: Logistic Regression val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) Load data in memory once Initial parameter vector for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient Repeated MapReduce steps } to do gradient descent println("Final w: " + w)
  • 13. Logistic Regression Performance 4500 4000 3500 127 s / iteration Running Time (s) 3000 2500 Hadoop 2000 1500 Spark 1000 500 0 first iteration 174 s further iterations 6 s 1 5 10 20 30 Number of Iterations
  • 14. Supported Operators map reduce sample filter count cogroup groupBy reduceByKey take sort groupByKey partitionBy join first pipe leftOuterJoin union save rightOuterJoin cross ...
  • 15. Other Engine Features General graphs of operators ( efficiency) A: B: G: Stage 1 groupBy C: D: F: map = RDD E: join = cached data Stage 2 union Stage 3
  • 16. Other Engine Features Controllable data partitioning to minimize communication PageRank Performance 200 171 Hadoop Iteration time (s) 150 Basic Spark 100 72 Spark + Controlled 50 23 Partitioning 0
  • 19. Applications In-memory analytics & anomaly detection (Conviva) Interactive queries on data streams (Quantifind) Exploratory log analysis (Foursquare) Traffic estimation w/ GPS data (Mobile Millennium) Twitter spam classification (Monarch) ...
  • 20. Conviva GeoReport Hive 20 Spark 0.5 Time (hours) 0 5 10 15 20 Group aggregations on many keys w/ same filter 40× gain over Hive from avoiding repeated reading, deserialization and filtering
  • 21. Quantifind Feed Analysis Parsed Extracted In-Memory Data Feeds Web Documents Entities Time Series Spark App queries Load data feeds, extract entities, and compute in-memory tables every few minutes Let users drill down interactively from AJAX app
  • 22. Mobile Millennium Project Estimate city traffic from crowdsourced GPS data Iterative EM algorithm scaling to 160 nodes Credit: Tim Hunter, with support of the Mobile Millennium team; P.I. Alex Bayen; traffic.berkeley.edu
  • 23. Shark: Hive on Spark
  • 24. Motivation Hive is great, but Hadoop’s execution engine makes even the smallest queries take minutes Scala is good for programmers, but many data users only know SQL Can we extend Hive to run on Spark?
  • 25. Hive Architecture Client CLI JDBC Driver Meta Physical Plan store SQL Query Parser Optimizer Execution MapReduce HDFS
  • 26. Shark Architecture Client CLI JDBC Driver Cache Mgr. Meta Physical Plan store SQL Query Parser Optimizer Execution Spark HDFS [Engle et al, SIGMOD 2012]
  • 27. Efficient In-Memory Storage Simply caching Hive records as Java objects is inefficient due to high per-object overhead Instead, Shark employs column-oriented storage using arrays of primitive types Row Storage Column Storage 1 john 4.1 1 2 3 2 mike 3.5 john mike sally 3 sally 6.4 4.1 3.5 6.4
  • 28. Efficient In-Memory Storage Simply caching Hive records as Java objects is inefficient due to high per-object overhead Instead, Shark employs column-oriented storage using arrays of primitive types Row Storage Column Storage 1 john 4.1 1 2 3 Benefit: similarly compact size to serialized data, 2 but >5x faster to access sally mike 3.5 john mike 3 sally 6.4 4.1 3.5 6.4
  • 29. Using Shark CREATE TABLE mydata_cached AS SELECT … Run standard HiveQL on it, including UDFs » A few esoteric features are not yet supported Can also call from Scala to mix with Spark Early alpha release at shark.cs.berkeley.edu
  • 30. Benchmark Query 1 SELECT * FROM grep WHERE field LIKE „%XYZ%‟; Shark (cached) 12s Shark 182s Hive 207s 0 50 100 150 200 250 Execution Time (secs)
  • 31. Benchmark Query 2 SELECT sourceIP, AVG(pageRank), SUM(adRevenue) AS earnings FROM rankings AS R, userVisits AS V ON R.pageURL = V.destURL WHERE V.visitDate BETWEEN „1999-01-01‟ AND „2000-01-01‟ GROUP BY V.sourceIP ORDER BY earnings DESC LIMIT 1; Shark (cached) 126s Shark 270s Hive 447s 0 100 200 300 400 500 Execution Time (secs)
  • 32. Demo
  • 34. Streaming Spark Many key big data apps must run in real time » Live event reporting, click analysis, spam filtering, … Event-passing systems (e.g. Storm) are low-level » Users must worry about FT, state, consistency » Programming model is different from batch, so must write each app twice Can we give streaming a Spark-like interface?
  • 35. Our Idea Run streaming computations as a series of very short (<1 second) batch jobs » “Discretized stream processing” Keep state in memory as RDDs (automatically recover from any failure) Provide a functional API similar to Spark
  • 36. Spark Streaming API Functional operators on discretized streams New “stateful” operators for windowing pageViews ones counts pageViews = readStream("...", "1s") t = 1: ones = pageViews.map(ev => (ev.url, 1)) map reduce counts = ones.runningReduce(_ + _) t = 2: D-streams Transformation sliding = ones.reduceByWindow( “5s”, _ + _, _ - _) ... = RDD = partition Sliding window reduce with “add” and “subtract” functions
  • 37. Streaming + Batch + Ad-Hoc Combining D-streams with historical data: pageViews.join(historicCounts).map(...) Interactive ad-hoc queries on stream state: counts.slice(“21:00”, “21:05”).topK(10)
  • 38. How Fast Can It Go? Can process 4 GB/s (42M records/s) of data on 100 nodes at sub-second latency Recovers from failures within 1 sec 5 5 Grep TopKWords Cluster Throughput 4 4 3 3 (GB/s) 2 2 1 1 0 1 sec 2 sec 0 0 50 100 0 50 100 Maximum throughput possible with 1s or 2s latency
  • 39. Performance vs Storm Spark Storm Spark Storm 60 30 Grep Throughput TopK Throughput (MB/s/node) (MB/s/node) 40 20 20 10 0 0 10000 100 10000 100 Record Size (bytes) Record Size (bytes) Storm limited to 10,000 records/s/node Also tried S4: 7000 records/s/node
  • 40. Streaming Roadmap Alpha release expected in August Spark engine changes already in “dev” branch
  • 41. Conclusion Spark & Shark speed up your interactive, complex, and (soon) streaming analytics on Hadoop data Download and docs: www.spark-project.org » Easy local mode and deploy scripts for EC2 User meetup: meetup.com/spark-users Training camp at Berkeley in August! matei@berkeley.edu / @matei_zaharia
  • 42. Behavior with Not Enough RAM 100 68.8 Iteration time (s) 58.1 80 40.7 60 29.7 40 11.5 20 0 Cache 25% 50% 75% Fully disabled cached % of working set in memory
  • 43. Software Stack Shark Bagel Streaming (Hive on Spark) (Pregel on Spark) Spark … Spark Local Apache EC2 YARN mode Mesos

Editor's Notes

  1. Each iteration is, for example, a MapReduce job
  2. Add “variables” to the “functions” in functional programming
  3. Key idea: add “variables” to the “functions” in functional programming
  4. This is for a 29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)
  5. This will show interactive search on 50 GB of Wikipedia data
  6. This will show interactive search on 50 GB of Wikipedia data
  7. Query planning is also better in Shark due to (1) more optimizations and (2) use of more optimized Spark operators such as hash-based join
  8. This will show interactive search on 50 GB of Wikipedia data
  9. This will show interactive search on 50 GB of Wikipedia data
  10. Streaming Spark offers similar speed while providing FT and consistency guarantees that these systems lack