SlideShare a Scribd company logo
Spark and Shark: High-speed
Analytics over Hadoop Data
Sep 12, 2013 @ Yahoo! Tech Conference
Reynold Xin (辛湜), AMPLab, UC Berkeley
The Spark Ecosystem
Spark
Shark
SQL
HDFS / Hadoop Storage
Mesos/YARN Resource Manager
Spark
Streaming
GraphX MLBase
Today’s Talk
Spark
Shark
SQL
HDFS / Hadoop Storage
Mesos/YARN Resource Manager
Spark
Streaming
GraphX MLBase
Apache Spark
Fast and expressive cluster computing system
interoperable with Apache Hadoop
Improves efficiency through:
» In-memory computing primitives
» General computation graphs
Improves usability through:
» Rich APIs in Scala, Java, Python
» Interactive shell
Up to 100× faster
(2-10× on disk)
Often 5× less code
Shark
An analytic query engine built on top of Spark
» Support both SQL and complex analytics
» Can be100X faster than Apache Hive
Compatible with Hive data, metastore, queries
» HiveQL
» UDF / UDAF
» SerDes
» Scripts
Community
3000 people attended online
training
1100+ meetup members
80+ GitHub contributors
Today’s Talk
Spark
Shark
SQL
HDFS / Hadoop Storage
Mesos Resource Manager
Spark
Streaming
GraphX MLBase
Why a New Framework?
MapReduce greatly simplified big data analysis
But as soon as it got popular, users wanted more:
» More complex, multi-pass analytics (e.g. ML, graph)
» More interactive ad-hoc queries
» More real-time stream processing
Hadoop MapReduce
iter. 1 iter. 2 …
Input
filesystem
read
filesystem
write
filesystem
read
filesystem
write
Iterative:
Interactive:
Input
query 1
query 2
…
select
Subset
Slow due to replication and disk I/O,
but necessary for fault tolerance
Spark Programming Model
Key idea: resilient distributed datasets (RDDs)
» Distributed collections of objects that can optionally be
cached in memory across cluster
» Manipulated through parallel operators
» Automatically recomputed on failure
Programming interface
» Functional APIs in Scala, Java, Python
» Interactive use from Scala shell
Example: Log Mining
Exposes RDDs through a functional API in Scala
Usable interactively from Scala shell
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
errors.persist() Block 1
Block 2
Block 3
Worker
errors.filter(_.contains(“foo”)).count()
errors.filter(_.contains(“bar”)).count()
tasks
results
Errors 2
Base RDD
Transformed RDD
Action
Result: full-text search of Wikipedia in
<1 sec (vs 20 sec for on-disk data)
Result: 1 TB data in 5 sec
(vs 170 sec for on-disk data)
Worker
Errors 3
Worker
Errors 1
Master
Data Sharing in Spark
iter. 1 iter. 2 …
Input
Iterative:
Interactive:
Input
query 1
query 2select
…
10-100x faster than network/disk
Fault Tolerance
RDDs track the series of transformations used to
build them (their lineage) to recompute lost data
E.g: messages = textFile(...).filter(_.contains(“error”))
.map(_.split(„t‟)(2))
HadoopRDD
path = hdfs://…
FilteredRDD
func = _.contains(...)
MappedRDD
func = _.split(…)
Spark: Expressive API
map
filter
groupBy
sort
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
groupByKey
cogroup
cross
zip
sample
take
first
partitionBy
mapWith
pipe
save
...
15
and broadcast variables, accumulators…
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Word Count
val docs = sc.textFiles(“hdfs://…”)
docs.flatMap(line => line.split(“s”))
.map(word => (word, 1))
.reduceByKey((a, b) => a + b)
Word Count
val docs = sc.textFiles(“hdfs://…”)
docs.flatMap(line => line.split(“s”))
.map(word => (word, 1))
.reduceByKey((a, b) => a + b)
docs.flatMap(_.split(“s”))
.map((_, 1))
.reduceByKey(_ + _)
Spark in Java and Python
Python API
lines = spark.textFile(…)
errors = lines.filter(
lambda s: "ERROR" in s)
errors.count()
Java API
JavaRDD<String> lines =
spark.textFile(…);
errors = lines.filter(
new Function<String, Boolean>() {
public Boolean call(String s) {
return s.contains("ERROR");
}
});
errors.count()
Scheduler
General task DAGs
Pipelines functions
within a stage
Cache-aware data
locality & reuse
Partitioning-aware
to avoid shuffles
Launch jobs in ms
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= previously computed partition
Example: Logistic
Regression
Goal: find best line separating two sets of points
target
random initial line
22
Logistic Regression: Prep the Data
val sc = new SparkContext(“spark://master”, “LRExp”)
def loadDict(): Map[String, Double] = { … }
val dict = sc.broadcast( loadDict() )
def parser(s: String): (Array[Double], Double) = {
val parts = s.split(„t‟)
val y = parts(0).toDouble
val x = parts.drop(1).map(dict.getOrElse(_, 0.0))
(x, y)
}
val data = sc.textFile(“hdfs://d”).map(parser).cache()
Load data in memory once
Start a new Spark Session
Build a dictionary for file parsing
Broadcast the Dictionary to the Cluster
Line Parser which uses the Dictionary
23
Logistic Regression: Grad. Desc.
val sc = new SparkContext(“spark://master”, “LRExp”)
// Parsing …
val data: Spark.RDD[(Array[Double],Double)] = …
var w = Vector.random(D)
val lambda = 1.0/data.count
for (i <- 1 to ITERATIONS) {
val gradient = data.map {
case (x, y) => (1.0/(1+exp(-y*(w dot x)))–1)*y*x
}.reduce( (g1, g2) => g1 + g2 )
w -= (lambda / sqrt(i) ) * gradient
}
Initial parameter vector
Iterate
Set Learning Rate
Apply Gradient on Master
Compute Gradient Using
Map & Reduce
24
Logistic Regression
Performance
0
500
1000
1500
2000
2500
3000
3500
4000
1 5 10 20 30
RunningTime(s)
Number of Iterations
Hadoop
Spark
110 s / iteration
first iteration 80 s
further iterations 1 s
Spark ML Lib (Spark 0.8)
Classification Logistic Regression, Linear SVM
(+L1, L2)
Regression Linear Regression
(+Lasso, Ridge)
Collaborative Filtering Alternating Least Squares
Clustering K-Means
Optimization SGD, Parallel Gradient
Spark
Spark
Shark
SQL
HDFS / Hadoop Storage
Mesos Resource Manager
Spark
Streaming
GraphX MLBase
Shark
Spark
Shark
SQL
HDFS / Hadoop Storage
Mesos Resource Manager
Spark
Streaming
GraphX MLBase
Shark
A data analytics system that
» builds on Spark,
» scales out and tolerate worker failures,
» supports low-latency, interactive queries through
(optional) in-memory computation,
» supports both SQL and complex analytics,
» is compatible with Hive (storage, serdes, UDFs, types,
metadata).
Hive Architecture
Meta
store
HDFS
Client
Driver
SQL
Parse
r
Query
Optimizer
Physical Plan
Execution
CLI JDBC
MapReduce
Shark Architecture
Meta
store
HDFS
Client
Driver
SQL
Parse
r
Physical Plan
Execution
CLI JDBC
Spark
Cache Mgr.
Query
Optimizer
Shark Query Language
Hive QL and a few additional DDL add-ons:
Creating an in-memory table:
» CREATE TABLE src_cached AS SELECT * FROM …
Engine Features
Columnar Memory Store
Data Co-partitioning & Co-location
Partition Pruning based on Statistics (range, bloom
filter and others)
Spark Integration
…
Columnar Memory Store
Simply caching Hive records as JVM objects is
inefficient.
Shark employs column-oriented storage.
1
Column Storage
2 3
john mike sally
4.1 3.5 6.4
Row Storage
1 john 4.1
2 mike 3.5
3 sally 6.4Benefit: compact representation, CPU efficient
compression, cache locality.
Spark Integration
Unified system for query
processing and Spark
analytics
Query processing and ML
share the same set of
workers and caches
def logRegress(points: RDD[Point]): Vector {
var w = Vector(D, _ => 2 * rand.nextDouble - 1)
for (i <- 1 to ITERATIONS) {
val gradient = points.map { p =>
val denom = 1 + exp(-p.y * (w dot p.x))
(1 / denom - 1) * p.y * p.x
}.reduce(_ + _)
w -= gradient
}
w
}
val users = sql2rdd("SELECT * FROM user u
JOIN comment c ON c.uid=u.uid")
val features = users.mapRows { row =>
new Vector(extractFeature1(row.getInt("age")),
extractFeature2(row.getStr("country")),
...)}
val trainedVector = logRegress(features.cache())
Performance
0
25
50
75
100
Q1 Q2 Q3 Q4
Runtime(seconds)
Shark Shark (disk) Hive
1.1 0.8 0.7 1.0
1.7 TB Real Warehouse Data on 100 EC2 nodes
More Information
Download and docs: www.spark-project.org
» Easy to run locally, on EC2, or on Mesos/YARN
Email: rxin@apache.org
Twitter: @rxin
Behavior with Insufficient RAM
68.8
58.1
40.7
29.7
11.5
0
20
40
60
80
100
0% 25% 50% 75% 100%
Iterationtime(s)
Percent of working set in memory
Example: PageRank
1. Start each page with a rank of 1
2. On each iteration, update each page’s rank to
Σi∈neighbors ranki / |neighborsi|
links = // RDD of (url, neighbors) pairs
ranks = // RDD of (url, rank) pairs
for (i <- 1 to ITERATIONS) {
ranks = links.join(ranks).flatMap {
(url, (links, rank)) =>
links.map(dest => (dest, rank/links.size))
}.reduceByKey(_ + _)
}

More Related Content

What's hot

Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
Tudor Lapusan
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
spinningmatt
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
Dean Chen
 
Spark overview
Spark overviewSpark overview
Spark overview
Lisa Hua
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
David Lauzon
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Datio Big Data
 
Apache Spark
Apache SparkApache Spark
Apache Spark
Uwe Printz
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Akhil Das
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
Fabio Fumarola
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
Fernando Rodriguez
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
Xuan-Chao Huang
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
Rahul Jain
 
Apache Spark at Viadeo
Apache Spark at ViadeoApache Spark at Viadeo
Apache Spark at Viadeo
Cepoi Eugen
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
Corley S.r.l.
 
Spark core
Spark coreSpark core
Spark core
Prashant Gupta
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
Mathias Herberts
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Patrick Wendell
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Farzad Nozarian
 

What's hot (20)

Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Spark overview
Spark overviewSpark overview
Spark overview
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
Apache Spark at Viadeo
Apache Spark at ViadeoApache Spark at Viadeo
Apache Spark at Viadeo
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Spark core
Spark coreSpark core
Spark core
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 

Viewers also liked

C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
DataStax Academy
 
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at OoyalaCassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
DataStax Academy
 
Big Data and the Cloud a Best Friend Story
Big Data and the Cloud a Best Friend StoryBig Data and the Cloud a Best Friend Story
Big Data and the Cloud a Best Friend Story
Amazon Web Services
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
Uri Laserson
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
Nick Dimiduk
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
Ricardo Varela
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
Carl Steinbach
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
Hortonworks
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
Kevin Weil
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
Philippe Julio
 

Viewers also liked (11)

C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
 
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at OoyalaCassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
 
Big Data and the Cloud a Best Friend Story
Big Data and the Cloud a Best Friend StoryBig Data and the Cloud a Best Friend Story
Big Data and the Cloud a Best Friend Story
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 

Similar to 20130912 YTC_Reynold Xin_Spark and Shark

Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Mohamed hedi Abidi
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
WalmirCouto3
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
Vincent Poncet
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
C4Media
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Anastasios Skarlatidis
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkDatabricks
 
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Lucidworks
 
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
xPatterns on Spark, Tachyon and Mesos - Bucharest meetupxPatterns on Spark, Tachyon and Mesos - Bucharest meetup
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
Radu Chilom
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
Massimo Schenone
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
Taewook Eom
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
Snehal Nagmote
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
Databricks
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
jeykottalam
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
Vienna Data Science Group
 
Spark training-in-bangalore
Spark training-in-bangaloreSpark training-in-bangalore
Spark training-in-bangalore
Kelly Technologies
 

Similar to 20130912 YTC_Reynold Xin_Spark and Shark (20)

Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
 
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
xPatterns on Spark, Tachyon and Mesos - Bucharest meetupxPatterns on Spark, Tachyon and Mesos - Bucharest meetup
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Spark training-in-bangalore
Spark training-in-bangaloreSpark training-in-bangalore
Spark training-in-bangalore
 

Recently uploaded

Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
Jen Stirrup
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 

Recently uploaded (20)

Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 

20130912 YTC_Reynold Xin_Spark and Shark

  • 1. Spark and Shark: High-speed Analytics over Hadoop Data Sep 12, 2013 @ Yahoo! Tech Conference Reynold Xin (辛湜), AMPLab, UC Berkeley
  • 2. The Spark Ecosystem Spark Shark SQL HDFS / Hadoop Storage Mesos/YARN Resource Manager Spark Streaming GraphX MLBase
  • 3. Today’s Talk Spark Shark SQL HDFS / Hadoop Storage Mesos/YARN Resource Manager Spark Streaming GraphX MLBase
  • 4. Apache Spark Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: » In-memory computing primitives » General computation graphs Improves usability through: » Rich APIs in Scala, Java, Python » Interactive shell Up to 100× faster (2-10× on disk) Often 5× less code
  • 5. Shark An analytic query engine built on top of Spark » Support both SQL and complex analytics » Can be100X faster than Apache Hive Compatible with Hive data, metastore, queries » HiveQL » UDF / UDAF » SerDes » Scripts
  • 6. Community 3000 people attended online training 1100+ meetup members 80+ GitHub contributors
  • 7.
  • 8. Today’s Talk Spark Shark SQL HDFS / Hadoop Storage Mesos Resource Manager Spark Streaming GraphX MLBase
  • 9. Why a New Framework? MapReduce greatly simplified big data analysis But as soon as it got popular, users wanted more: » More complex, multi-pass analytics (e.g. ML, graph) » More interactive ad-hoc queries » More real-time stream processing
  • 10. Hadoop MapReduce iter. 1 iter. 2 … Input filesystem read filesystem write filesystem read filesystem write Iterative: Interactive: Input query 1 query 2 … select Subset Slow due to replication and disk I/O, but necessary for fault tolerance
  • 11. Spark Programming Model Key idea: resilient distributed datasets (RDDs) » Distributed collections of objects that can optionally be cached in memory across cluster » Manipulated through parallel operators » Automatically recomputed on failure Programming interface » Functional APIs in Scala, Java, Python » Interactive use from Scala shell
  • 12. Example: Log Mining Exposes RDDs through a functional API in Scala Usable interactively from Scala shell lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) errors.persist() Block 1 Block 2 Block 3 Worker errors.filter(_.contains(“foo”)).count() errors.filter(_.contains(“bar”)).count() tasks results Errors 2 Base RDD Transformed RDD Action Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: 1 TB data in 5 sec (vs 170 sec for on-disk data) Worker Errors 3 Worker Errors 1 Master
  • 13. Data Sharing in Spark iter. 1 iter. 2 … Input Iterative: Interactive: Input query 1 query 2select … 10-100x faster than network/disk
  • 14. Fault Tolerance RDDs track the series of transformations used to build them (their lineage) to recompute lost data E.g: messages = textFile(...).filter(_.contains(“error”)) .map(_.split(„t‟)(2)) HadoopRDD path = hdfs://… FilteredRDD func = _.contains(...) MappedRDD func = _.split(…)
  • 16. public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
  • 17. Word Count val docs = sc.textFiles(“hdfs://…”) docs.flatMap(line => line.split(“s”)) .map(word => (word, 1)) .reduceByKey((a, b) => a + b)
  • 18. Word Count val docs = sc.textFiles(“hdfs://…”) docs.flatMap(line => line.split(“s”)) .map(word => (word, 1)) .reduceByKey((a, b) => a + b) docs.flatMap(_.split(“s”)) .map((_, 1)) .reduceByKey(_ + _)
  • 19. Spark in Java and Python Python API lines = spark.textFile(…) errors = lines.filter( lambda s: "ERROR" in s) errors.count() Java API JavaRDD<String> lines = spark.textFile(…); errors = lines.filter( new Function<String, Boolean>() { public Boolean call(String s) { return s.contains("ERROR"); } }); errors.count()
  • 20. Scheduler General task DAGs Pipelines functions within a stage Cache-aware data locality & reuse Partitioning-aware to avoid shuffles Launch jobs in ms join union groupBy map Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: G: = previously computed partition
  • 21. Example: Logistic Regression Goal: find best line separating two sets of points target random initial line 22
  • 22. Logistic Regression: Prep the Data val sc = new SparkContext(“spark://master”, “LRExp”) def loadDict(): Map[String, Double] = { … } val dict = sc.broadcast( loadDict() ) def parser(s: String): (Array[Double], Double) = { val parts = s.split(„t‟) val y = parts(0).toDouble val x = parts.drop(1).map(dict.getOrElse(_, 0.0)) (x, y) } val data = sc.textFile(“hdfs://d”).map(parser).cache() Load data in memory once Start a new Spark Session Build a dictionary for file parsing Broadcast the Dictionary to the Cluster Line Parser which uses the Dictionary 23
  • 23. Logistic Regression: Grad. Desc. val sc = new SparkContext(“spark://master”, “LRExp”) // Parsing … val data: Spark.RDD[(Array[Double],Double)] = … var w = Vector.random(D) val lambda = 1.0/data.count for (i <- 1 to ITERATIONS) { val gradient = data.map { case (x, y) => (1.0/(1+exp(-y*(w dot x)))–1)*y*x }.reduce( (g1, g2) => g1 + g2 ) w -= (lambda / sqrt(i) ) * gradient } Initial parameter vector Iterate Set Learning Rate Apply Gradient on Master Compute Gradient Using Map & Reduce 24
  • 24. Logistic Regression Performance 0 500 1000 1500 2000 2500 3000 3500 4000 1 5 10 20 30 RunningTime(s) Number of Iterations Hadoop Spark 110 s / iteration first iteration 80 s further iterations 1 s
  • 25. Spark ML Lib (Spark 0.8) Classification Logistic Regression, Linear SVM (+L1, L2) Regression Linear Regression (+Lasso, Ridge) Collaborative Filtering Alternating Least Squares Clustering K-Means Optimization SGD, Parallel Gradient
  • 26. Spark Spark Shark SQL HDFS / Hadoop Storage Mesos Resource Manager Spark Streaming GraphX MLBase
  • 27. Shark Spark Shark SQL HDFS / Hadoop Storage Mesos Resource Manager Spark Streaming GraphX MLBase
  • 28. Shark A data analytics system that » builds on Spark, » scales out and tolerate worker failures, » supports low-latency, interactive queries through (optional) in-memory computation, » supports both SQL and complex analytics, » is compatible with Hive (storage, serdes, UDFs, types, metadata).
  • 31. Shark Query Language Hive QL and a few additional DDL add-ons: Creating an in-memory table: » CREATE TABLE src_cached AS SELECT * FROM …
  • 32. Engine Features Columnar Memory Store Data Co-partitioning & Co-location Partition Pruning based on Statistics (range, bloom filter and others) Spark Integration …
  • 33. Columnar Memory Store Simply caching Hive records as JVM objects is inefficient. Shark employs column-oriented storage. 1 Column Storage 2 3 john mike sally 4.1 3.5 6.4 Row Storage 1 john 4.1 2 mike 3.5 3 sally 6.4Benefit: compact representation, CPU efficient compression, cache locality.
  • 34. Spark Integration Unified system for query processing and Spark analytics Query processing and ML share the same set of workers and caches def logRegress(points: RDD[Point]): Vector { var w = Vector(D, _ => 2 * rand.nextDouble - 1) for (i <- 1 to ITERATIONS) { val gradient = points.map { p => val denom = 1 + exp(-p.y * (w dot p.x)) (1 / denom - 1) * p.y * p.x }.reduce(_ + _) w -= gradient } w } val users = sql2rdd("SELECT * FROM user u JOIN comment c ON c.uid=u.uid") val features = users.mapRows { row => new Vector(extractFeature1(row.getInt("age")), extractFeature2(row.getStr("country")), ...)} val trainedVector = logRegress(features.cache())
  • 35. Performance 0 25 50 75 100 Q1 Q2 Q3 Q4 Runtime(seconds) Shark Shark (disk) Hive 1.1 0.8 0.7 1.0 1.7 TB Real Warehouse Data on 100 EC2 nodes
  • 36. More Information Download and docs: www.spark-project.org » Easy to run locally, on EC2, or on Mesos/YARN Email: rxin@apache.org Twitter: @rxin
  • 37. Behavior with Insufficient RAM 68.8 58.1 40.7 29.7 11.5 0 20 40 60 80 100 0% 25% 50% 75% 100% Iterationtime(s) Percent of working set in memory
  • 38. Example: PageRank 1. Start each page with a rank of 1 2. On each iteration, update each page’s rank to Σi∈neighbors ranki / |neighborsi| links = // RDD of (url, neighbors) pairs ranks = // RDD of (url, rank) pairs for (i <- 1 to ITERATIONS) { ranks = links.join(ranks).flatMap { (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) }.reduceByKey(_ + _) }

Editor's Notes

  1. FT = Fourier Transform
  2. NOT a modified versionof Hadoop
  3. Note that dataset is reused on each gradient computation
  4. Key idea: add “variables” to the “functions” in functional programming
  5. Key idea: add “variables” to the “functions” in functional programming
  6. 100 GB of data on 50 m1.xlarge EC2 machines
  7. Query planning is also better in Shark due to (1) more optimizations and (2) use of more optimized Spark operators such as hash-based join