slides_nyc_2015

to
From
Nathan Halko @nhalko
Data Scientist Hadoop day @ Data Summit
NYC May 2015

to
From
Trusty old friend
but
slobbers over
everything
Cute and shiny
but
would sell you out for
another bottle of wine

At
We turn social noise into
customer revenue
Ingest
Process
Deliver

From Hadoop MR job to Spark clone of job
…we can do better
Joins
Caching
Streaming
Counters
Laziness

class MyMapper {
override def
map(key, value, context) { ... }
}
class MyReducer {
override def
reduce(key, values, context) { ... }
}
class MyJob {
// bunch of setup
}
object MyJob {
def main(args) { //run the job }
}
def spark {
sc.textFile("input").map {
case (key, value) =>
/ / …
}
.groupByKey()
.map {
case (key, values) =>
/ / …
}
}
Serializability

Joins
Map Map
Table 2Table 1
Shufﬂe Process
Cassandra
Map
Shufﬂe
Reduce
Process
HDFS
Data 1 Data 2

Ca$$hing
Map
“data3”
Transform
Map
res1
Cassandra
HDFS
Map
res2
Broadcast variables
Named RDDs

Streaming
bin/hadoop jar hadoop-streaming.jar
-input myInputDirs
-output myOutputDir
-mapper /bin/cat
-reducer /bin/wc
On the ﬂy job creation
Kafka queue
head —
offset —
Listening, always on processing

Laziness
Map
Reduce
Reduce
Reduce
Map
Map
Map
Map

1. Set number of cores and heap per job
2. Internal map tasks reuse JVM
3. RDDs feel like Scala collections
4. I can read the source!! (Storm)
5. GraphX
1. Static cluster wide settings
2. JVM startup time costly
3. Reducer values iterator… yuck
4. Time for a beer, or six
5. Giraph
Poor setup docs, mostly high level dev
Serializability, runtime
We at least know how to do this
Most any code can ﬁt in a mapper

Thank you!
spotright.com
nathan@spotright.com
Nathan Halko @nhalko
Data Scientist

slides_nyc_2015

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to slides_nyc_2015

Similar to slides_nyc_2015 (20)

slides_nyc_2015