2. Agenda
• What Hadoop gives us
• What everyone is complaining about in 2013
• Spark
– Berkeley Team
– BDAS (Berkeley Data Analytics Stack)
– RDDs (Resilient Distributed Datasets)
– Shark
– Spark Streaming
– Other Spark subsystems
Global Big Data Apr 23, 2013 technicaltidbit.com 2
3. What Hadoop Gives Us
• HDFS
• Map/Reduce
Global Big Data Apr 23, 2013 technicaltidbit.com 3
4. Hadoop: HDFS
Image from mark.chmarny.com
Global Big Data Apr 23, 2013 technicaltidbit.com 4
5. Hadoop: Map/Reduce
Image from blog.octo.com
Image from people.apache.org/~rdonkin
Global Big Data Apr 23, 2013 technicaltidbit.com 5
6. Map/Reduce Tools
Pig Script HiveQL Hbase App
Pig Hive
Hadoop
Linux
Global Big Data Apr 23, 2013 technicaltidbit.com 6
7. Hadoop Distribution Dogs in the
Race
Hadoop Distribution Query Tool
Apache Drill
Stinger
Global Big Data Apr 23, 2013 technicaltidbit.com 7
8. Other Open Source Solutions
• Druid
• Spark
Global Big Data Apr 23, 2013 technicaltidbit.com 8
9. Not just caching, but streaming
• 1st generation: HDFS
• 2nd generation: Caching & “Push” Map/Reduce
• 3rd generation: Streaming
Global Big Data Apr 23, 2013 technicaltidbit.com 9
10. Berkeley Team
• 40 students
• 8 faculty
• 3 staff software
engineers
• Silicon Valley style
skunkworks office Image from Ian Stoica’s slides from Strata 2013 presentation
space
• 2 years into 6 year
program
Global Big Data Apr 23, 2013 technicaltidbit.com 10
11. BDAS
(Berkeley Data Analytics Stack)
Spark Streaming
Bagel App Shark App
App
Bagel Shark Spark Streaming Spark App
Spark
Hadoop/HDFS
Mesos
Linux
Global Big Data Apr 23, 2013 technicaltidbit.com 11
12. RDDs
(Resilient Distributed Dataset)
Image from Matei Zaharia’s paper
Global Big Data Apr 23, 2013 technicaltidbit.com 12
13. RDDs: Laziness
x => x.startsWith(“ERROR”)
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
.map(_.split(‘t’)(2)) All Lazy
.filter(_.contains(“foo”))
cnt = errors.count
Action!
Global Big Data Apr 23, 2013 technicaltidbit.com 13
14. RDDs: Transformations vs. Actions
Transformations Actions
map(func) reduce(func)
filter(func) collect()
flatMap(func) count()
sample(withReplacement, take(n)
frac, seed) first()
union(otherDataset) saveAsTextFile(path)
groupByKey[K,V](func) saveAsSequenceFile(path)
reduceByKey[K,V](func) foreach(func)
join[K,V,W](otherDataset)
cogroup[K,V,W1,W2](other1,
other2)
cartesian[U](otherDataset)
sortByKey[K,V]
[K,V] in Scala same as <K,V>
templates in C++, Java
Global Big Data Apr 23, 2013 technicaltidbit.com 14
15. Hive vs. Shark
Shark
HiveQL
HiveQL
HiveQL
HiveQL
HDFS files HDFS files
+ RDDs
Global Big Data Apr 23, 2013 technicaltidbit.com 15
16. Shark: Copy from HDFS to RDD
CREATE TABLE wiki_small_in_mem TBLPROPERTIES
("shark.cache" = "true") AS SELECT * FROM wiki;
CREATE TABLE wiki_cached AS SELECT * FROM wiki;
Creates a table that is stored in a cluster’s
memory using RDD.cache().
Global Big Data Apr 23, 2013 technicaltidbit.com 16
17. Shark: Just a Shim
Shark
Images from Reynold Xin’s presentation
Global Big Data Apr 23, 2013 technicaltidbit.com 17
18. What about “Big Data”?
PB
TB
Shark Effectiveness
Shark Effectiveness
GB
MB
KB
Global Big Data Apr 23, 2013 technicaltidbit.com 18
19. Median Hadoop job input size
Image from Reynold Xin’s presentation
Global Big Data Apr 23, 2013 technicaltidbit.com 19
21. Spark Streaming: DStream
• “A series of small batches”
{{“id”: “hercman”}, {{“id”: “hercman”},
{{“id”: “shewolf”},
“eventType”: “eventType”:
“eventType”: “error”}} RDD 2 sec
“buyGoods”}} “buyGoods”}}
{{“id”: “shewolf”},
“eventType”: “error”}} RDD 2 sec
...
{{“id”: “catlover”},
{{“id”: “hercman”},
“eventType”:
“eventType”: “logOff”}} RDD 2 sec
“buyGoods”}}
DStream
DStream
Global Big Data Apr 23, 2013 technicaltidbit.com 21
22. Spark Streaming: DAG
DStream
Dstream
.filter(
.foreach(
_.eventType==
println)
bj] “error”)
[EvO
am
tre
DStream[String] Dstream Ds
Kafka .transform
(JSON) Ds
tr eam
[Ev
Ob
j]
Dstream
Dstream
.filter(
.foreach(
_.eventType==
println)
“buyGoods”)
The DAG Dstream
.map((_.id,1))
Dstream
.groupByKey
Global Big Data Apr 23, 2013 technicaltidbit.com 22
23. Spark Streaming: Example Code
// Initialize
val ssc = new StreamingContext(“mesos://localhost”, “games”, Seconds(2), …)
val msgs = ssc.kafkaStream[String](prm, topic, StorageLevel.MEMORY_AND_DISK)
// DAG
val events:Dstream[evObj] = messages.transform(rdd => rdd.map(new evObj(_))
val errorCounts = events.filter(_.eventType == “error”)
errorCounts.foreach(rdd => println(rdd.count))
val usersBuying = events.filter(_.eventType == “buyGoods”).map((_.id,1))
.groupByKey
usersBuying.foreach(rdd => println(rdd.count))
// Go
ssc.start
Global Big Data Apr 23, 2013 technicaltidbit.com 23
24. Stateful Spark Streaming
Class ErrorsPerUser(var numErrors:Int=0) extends Serializable
val updateFunc = (values:Seq[evObj], state:Option[ErrorsPerUser]) => {
if (values.find(_.eventType == “logOff”) == None)
None
else {
values.foreach(e => {
e.eventType match { “error” => state.numErrors += 1 }
})
Option(state)
}
}
// DAG
val events:Dstream[evObj] = messages.transform(rdd => rdd.map(new evObj(_))
val errorCounts = events.filter(_.eventType == “error”)
val states = errorCounts.map((_.id,1))
.updateStateByKey[ErrorsPerUser](updateFunc)
// Off-DAG
states.foreach(rdd => println(“Num users experiencing errors:” + rdd.count))
Global Big Data Apr 23, 2013 technicaltidbit.com 24
25. Other Spark Subsystems
• Bagel (similar to Google Pregel)
• Sparkler (Matrix decomposition)
• (Machine Learning)
Global Big Data Apr 23, 2013 technicaltidbit.com 25
26. Teaser
• Future Meetup: Machine
learning from real-time
data streams
Global Big Data Apr 23, 2013 technicaltidbit.com 26