Apache Spark: the next big thing? - StampedeCon 2014

A P A C H E S P A R K
S T A M P E D E C O N 2 0 1 4
S T E V E N B O R R E L L I
@stevendborrelli
A S T E R I S

A B O U T M E
F O U N D E R , A S T E R I S ( J A N 2 0 1 4 )
O R G A N I Z E R O F S T L M A C H I N E
L E A R N I N G A N D D O C K E R S T L
S Y S T E M S E N G I N E E R I N G , H P C , B I G
D A T A , & C L O U D
N E X T G E N E R A T I O N I N F R A S T R U C T U R E F O R D E V E L O P E R S

S P A R K I N F I V E S E C O N D S
is a replacement for

WHY DO WE NEED TO REPLACE
MAPREDUCE?

M A P R E D U C E I S A W E S O M E !
Allows us to process
enormous
amounts of data in parallel

M A P R E D U C E
M A P R E D U C E : S I M P L I F I E D D A T A P R O C E S S I N G O N L A R G E C L U S T E R S ( 2 0 0 4 )
J E F F R E Y D E A N A N D S A N J A Y G H E M A W A T

HITTING THE LIMITS OF HADOOP’s
MAPREDUCE

T H E P R O B L E M S W I T H
M A P R E D U C E
API: Low-Level & Complex

M A P R E D U C E I S S U E S
• Latency
• Execution time impacted by “stragglers”
• Lack of in-memory caching
• Intermediate steps persisted to disk
• No shared state

T H E P R O B L E M S W I T H M A P
R E D U C E
Not optimal for:
M A C H I N E L E A R N I N G G R A P H S
S T R E A M
P R O C E S S I N G

I M P R O V I N G M A P R E D U C E
A P A C H E T E Z

• Generalize to different workloads
• Sub-Second Latency
• Scalable and Fault Tolerant
• Easy to use API
N E X T M A P R E D U C E : G O A L S

T O P S P A R K F E A T U R E S
• Fast, fault-tolerant in-memory data structures (RDD)
• Compatibility with Hadoop ecosystem
• Rich, easy-to-use API supports Machine Learning,
Graphs and Streaming
• Interactive Shell

S P A R K S T A C K
Integrated platform for disparate workloads

R E S I L I E N T D I S T R I B U T E D
D A T A S E T
• Immutable in-memory collections
• Fast recovery on failure
• Control caching and persistence to memory/disk
• Can partition to avoid shuffles

R D D L I N E A G E
lines = spark.textFile(“hdfs://errors/...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘t’)(2))

L A N G U A G E S U P P O R T
• Spark is written in
• Uses Scala collections & Akka Actors
• Java, Python native support (Python support can lag),
lambda support in Java8/Spark 1.0
• R Bindings through SparkR
• Functional programming paradigm

R D D T R A N S F O R M A T I O N S
Transformations create a new RDD
map
filter
flatMap
sample
union
distinct
groupByKey
reduceByKey
sortByKey
join
cogroup
cartesian
Transformations are evaluated lazily.

R D D A C T I O N S
Actions Return a value
reduce
collect
count
countByKey
countByValue
countApprox
foreach
saveAsSequenceFile
saveAsTextFile
first
take(n)
takeSample
toArray
Invoking an Action will cause all previous Transformations to
be evaluated.

T A S K S C H E D U L E R
H T T P : / / A MP C A M P . B E R K E L E Y . E D U / W P - C O N T E N T / U P L O A D S / 2 0 1 2 / 0 6 / M A T E I - Z A H A R I A - P A R T - 1 - A M P - C A M P - 2 0 1 2 - S P A R K - I N T R O . P D F
• Runs general
task graphs
• Pipelines
functions
where possible
• Cache-aware
data reuse &
locality
• Partitioning-
aware to avoid
shuffles

S P A R K S T R E A M I N G
• Micro-Batch: Discretized Stream (DStream)
• ~1 sec latency
• Fault tolerant
• Shares Much of the same code as Batch

T O P 1 0 H A S H T A G S I N L A S T 1 0
M I N
/ Create the stream of tweets
val tweets = ssc.twitterStream(<username>, <password>)
/ Count the tags over a 10 minute window
val tagCounts = tweets.flatMap(statuts => getTags(status))
.countByValueAndWindow(Minutes(10), Second(1))
/ Sort the tags by counts
val sortedTags = tagCounts.map
{ case (tag, count) => (count, tag) }
(_.sortByKey(false))
/ Show the top 10 tags
sortedTags.foreach(showTopTags(10) _)

• 10x + speedup after data is cached
• In-memory materialized views
• Supports HiveQL, UDFs, etc.
• New Catalyst SQL engine coming in 1.0 includes
SchemaRDD to mix & match RDD/SQL in code.

• Implementation of PowerGraph, Pregel on Spark
• .5x the speed of GraphLab, but more fault-tolerant

• Machine Learning library, part of Spark core.
• Uses jblas & gfortran. Python supports NumPy.
• Growing number of algorithms:
SVM, ALS, Naive Bayes, K-Means, Linear & Logistic
Regression. (SVD/PCA, CART, L-BGFS coming in 1.x)
M L L I B

• MLI: Higher level library to support Tables
(dataframes), Linear Algebra, Optimizers.
• MLI: alpha software, limited activity
• Can use Scikit-Learn or SparkR to run models on
Spark.
M L L I B +

C O M M U N I T Y
0
50
100
150
200
250
Patches
MapReduce
Storm
Yarn
Spark
0
10000
20000
30000
40000
50000
Lines Added
MapReduce
Storm
Yarn
Spark
0
3500
7000
10500
14000
17500
Lines Removed
MapReduce
Storm
Yarn
Spark

S P A R K M O M E N T U M
• 1.0 is imminent (in 1.0 RC testing right now)
• Databricks investment $14MM Andreessen Horowitz
• Partnerships with DataStax, Cloudera, MapR,
PivotalHD

T H A N K S !
steve@aster.is @stevendborrelli

Apache Spark: the next big thing? - StampedeCon 2014

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Viewers also liked

Viewers also liked (16)

Similar to Apache Spark: the next big thing? - StampedeCon 2014

Similar to Apache Spark: the next big thing? - StampedeCon 2014 (20)

More from StampedeCon

More from StampedeCon (20)

Recently uploaded

Recently uploaded (20)

Apache Spark: the next big thing? - StampedeCon 2014