Spark 2013-04-17

The Spark Ecosystem

Michael Malak

technicaltidbit.com

Agenda
• What Hadoop gives us
• What everyone is complaining about in 2013
• Spark
– Berkeley Team
– BDAS (Berkeley Data Analytics Stack)
– RDDs (Resilient Distributed Datasets)
– Shark
– Spark Streaming
– Other Spark subsystems
Global Big Data Apr 23, 2013 technicaltidbit.com 2

What Hadoop Gives Us
• HDFS
• Map/Reduce


Hadoop: HDFS

Image from mark.chmarny.com


Hadoop: Map/Reduce

Image from blog.octo.com

Image from people.apache.org/~rdonkin


Map/Reduce Tools

Pig Script HiveQL Hbase App

Pig Hive

Hadoop

Linux


Hadoop Distribution Dogs in the
Race
Hadoop Distribution Query Tool

Apache Drill

Stinger


Other Open Source Solutions
• Druid
• Spark


Not just caching, but streaming
• 1st generation: HDFS
• 2nd generation: Caching & “Push” Map/Reduce
• 3rd generation: Streaming


Berkeley Team
• 40 students
• 8 faculty
• 3 staff software
engineers
• Silicon Valley style
skunkworks office Image from Ian Stoica’s slides from Strata 2013 presentation
space
• 2 years into 6 year
program

BDAS
(Berkeley Data Analytics Stack)
Spark Streaming
Bagel App Shark App
App

Bagel Shark Spark Streaming Spark App

Spark
Hadoop/HDFS

Mesos

Linux


RDDs
(Resilient Distributed Dataset)

Image from Matei Zaharia’s paper


RDDs: Laziness
x => x.startsWith(“ERROR”)
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
.map(_.split(‘t’)(2)) All Lazy
.filter(_.contains(“foo”))
cnt = errors.count

Action!


RDDs: Transformations vs. Actions
Transformations Actions
map(func) reduce(func)
filter(func) collect()
flatMap(func) count()
sample(withReplacement, take(n)
frac, seed) first()
union(otherDataset) saveAsTextFile(path)
groupByKey[K,V](func) saveAsSequenceFile(path)
reduceByKey[K,V](func) foreach(func)
join[K,V,W](otherDataset)
cogroup[K,V,W1,W2](other1,
other2)
cartesian[U](otherDataset)
sortByKey[K,V]
[K,V] in Scala same as <K,V>
templates in C++, Java


Hive vs. Shark

Shark
HiveQL
HiveQL

HiveQL
HiveQL
HDFS files HDFS files
+ RDDs


Shark: Copy from HDFS to RDD
CREATE TABLE wiki_small_in_mem TBLPROPERTIES
("shark.cache" = "true") AS SELECT * FROM wiki;

CREATE TABLE wiki_cached AS SELECT * FROM wiki;

Creates a table that is stored in a cluster’s
memory using RDD.cache().


Shark: Just a Shim

Shark

Images from Reynold Xin’s presentation


What about “Big Data”?

PB

TB

Shark Effectiveness
Shark Effectiveness
GB

MB

KB

Median Hadoop job input size

Image from Reynold Xin’s presentation


Spark Streaming: Motivation

x1,000,000 clients
HDFS


Spark Streaming: DStream
• “A series of small batches”
{{“id”: “hercman”}, {{“id”: “hercman”},
{{“id”: “shewolf”},
“eventType”: “eventType”:
“eventType”: “error”}} RDD 2 sec
“buyGoods”}} “buyGoods”}}

{{“id”: “shewolf”},
“eventType”: “error”}} RDD 2 sec
...

{{“id”: “catlover”},
{{“id”: “hercman”},
“eventType”:
“eventType”: “logOff”}} RDD 2 sec
“buyGoods”}}

DStream
DStream


Spark Streaming: DAG
DStream
Dstream
.filter(
.foreach(
_.eventType==
println)
bj] “error”)
[EvO
am
tre
DStream[String] Dstream Ds
Kafka .transform
(JSON) Ds
tr eam
[Ev
Ob
j]
Dstream
Dstream
.filter(
.foreach(
_.eventType==
println)
“buyGoods”)

The DAG Dstream
.map((_.id,1))
Dstream
.groupByKey


Spark Streaming: Example Code
// Initialize
val ssc = new StreamingContext(“mesos://localhost”, “games”, Seconds(2), …)
val msgs = ssc.kafkaStream[String](prm, topic, StorageLevel.MEMORY_AND_DISK)

// DAG
val events:Dstream[evObj] = messages.transform(rdd => rdd.map(new evObj(_))

val errorCounts = events.filter(_.eventType == “error”)
errorCounts.foreach(rdd => println(rdd.count))

val usersBuying = events.filter(_.eventType == “buyGoods”).map((_.id,1))
.groupByKey
usersBuying.foreach(rdd => println(rdd.count))

// Go
ssc.start


Stateful Spark Streaming
Class ErrorsPerUser(var numErrors:Int=0) extends Serializable
val updateFunc = (values:Seq[evObj], state:Option[ErrorsPerUser]) => {
if (values.find(_.eventType == “logOff”) == None)
None
else {
values.foreach(e => {
e.eventType match { “error” => state.numErrors += 1 }
})
Option(state)
}
}

// DAG
val events:Dstream[evObj] = messages.transform(rdd => rdd.map(new evObj(_))
val errorCounts = events.filter(_.eventType == “error”)
val states = errorCounts.map((_.id,1))
.updateStateByKey[ErrorsPerUser](updateFunc)

// Off-DAG
states.foreach(rdd => println(“Num users experiencing errors:” + rdd.count))


Other Spark Subsystems
• Bagel (similar to Google Pregel)
• Sparkler (Matrix decomposition)
• (Machine Learning)


Teaser
• Future Meetup: Machine
learning from real-time
data streams


Spark 2013-04-17

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Spark 2013-04-17

Similar to Spark 2013-04-17 (20)

Recently uploaded

Recently uploaded (20)

Spark 2013-04-17