SlideShare a Scribd company logo
1 of 12
Download to read offline
to
From
Nathan Halko @nhalko
Data Scientist Hadoop day @ Data Summit
NYC May 2015
to
From
Trusty old friend
but
slobbers over
everything
Cute and shiny
but
would sell you out for
another bottle of wine
At
We turn social noise into
customer revenue
Ingest
Process
Deliver
From Hadoop MR job to Spark clone of job
…we can do better
Joins
Caching
Streaming
Counters
Laziness
class MyMapper {
override def
map(key, value, context) { ... }
}
class MyReducer {
override def
reduce(key, values, context) { ... }
}
class MyJob {
// bunch of setup
}
object MyJob {
def main(args) { //run the job }
}
def spark {
sc.textFile("input").map {
case (key, value) =>
/ / …
}
.groupByKey()
.map {
case (key, values) =>
/ / …
}
}
Serializability
Joins
Map Map
Table 2Table 1
Shuffle Process
Cassandra
Map
Shuffle
Reduce
Process
HDFS
Data 1 Data 2
Ca$$hing
Map
“data3”
Transform
Map
res1
Cassandra
HDFS
Map
res2
Broadcast variables
Named RDDs
Streaming
bin/hadoop jar hadoop-streaming.jar
-input myInputDirs
-output myOutputDir
-mapper /bin/cat
-reducer /bin/wc
On the fly job creation
Kafka queue
head —
offset —
Listening, always on processing
Counters
Laziness
Map
Reduce
Reduce
Reduce
Map
Map
Map
Map
1. Set number of cores and heap per job
2. Internal map tasks reuse JVM
3. RDDs feel like Scala collections
4. I can read the source!! (Storm)
5. GraphX
1. Static cluster wide settings
2. JVM startup time costly
3. Reducer values iterator… yuck
4. Time for a beer, or six
5. Giraph
Poor setup docs, mostly high level dev
Serializability, runtime
We at least know how to do this
Most any code can fit in a mapper
Thank you!
spotright.com
nathan@spotright.com
Nathan Halko @nhalko
Data Scientist

More Related Content

What's hot

HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
 
Big data reactive streams and OSGi - M Rulli
Big data reactive streams and OSGi - M RulliBig data reactive streams and OSGi - M Rulli
Big data reactive streams and OSGi - M Rullimfrancis
 
A time energy performance analysis of map reduce on heterogeneous systems wit...
A time energy performance analysis of map reduce on heterogeneous systems wit...A time energy performance analysis of map reduce on heterogeneous systems wit...
A time energy performance analysis of map reduce on heterogeneous systems wit...newmooxx
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map ReduceApache Apex
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixJeff Magnusson
 
Hive query optimization infinity
Hive query optimization infinityHive query optimization infinity
Hive query optimization infinityShashwat Shriparv
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
WaJUG - Introduction to data streaming
WaJUG - Introduction to data streamingWaJUG - Introduction to data streaming
WaJUG - Introduction to data streamingNicolas Fränkel
 
BruJUG - Introduction to data streaming
BruJUG - Introduction to data streamingBruJUG - Introduction to data streaming
BruJUG - Introduction to data streamingNicolas Fränkel
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-ReduceBrendan Tierney
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...Spark Summit
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
 
Big Data LDN 2017: Look Ma, No Code! Building Streaming Data Pipelines With A...
Big Data LDN 2017: Look Ma, No Code! Building Streaming Data Pipelines With A...Big Data LDN 2017: Look Ma, No Code! Building Streaming Data Pipelines With A...
Big Data LDN 2017: Look Ma, No Code! Building Streaming Data Pipelines With A...Matt Stubbs
 

What's hot (20)

HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Big data reactive streams and OSGi - M Rulli
Big data reactive streams and OSGi - M RulliBig data reactive streams and OSGi - M Rulli
Big data reactive streams and OSGi - M Rulli
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
EMR AWS Demo
EMR AWS DemoEMR AWS Demo
EMR AWS Demo
 
A time energy performance analysis of map reduce on heterogeneous systems wit...
A time energy performance analysis of map reduce on heterogeneous systems wit...A time energy performance analysis of map reduce on heterogeneous systems wit...
A time energy performance analysis of map reduce on heterogeneous systems wit...
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at Netflix
 
Traffichelper demo
Traffichelper demoTraffichelper demo
Traffichelper demo
 
Hive query optimization infinity
Hive query optimization infinityHive query optimization infinity
Hive query optimization infinity
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
WaJUG - Introduction to data streaming
WaJUG - Introduction to data streamingWaJUG - Introduction to data streaming
WaJUG - Introduction to data streaming
 
BruJUG - Introduction to data streaming
BruJUG - Introduction to data streamingBruJUG - Introduction to data streaming
BruJUG - Introduction to data streaming
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Big Data LDN 2017: Look Ma, No Code! Building Streaming Data Pipelines With A...
Big Data LDN 2017: Look Ma, No Code! Building Streaming Data Pipelines With A...Big Data LDN 2017: Look Ma, No Code! Building Streaming Data Pipelines With A...
Big Data LDN 2017: Look Ma, No Code! Building Streaming Data Pipelines With A...
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 

Similar to slides_nyc_2015

Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...Lightbend
 
Ruby on Big Data (Cassandra + Hadoop)
Ruby on Big Data (Cassandra + Hadoop)Ruby on Big Data (Cassandra + Hadoop)
Ruby on Big Data (Cassandra + Hadoop)Brian O'Neill
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Holden Karau
 
Big Data Analytics with Apache Spark
Big Data Analytics with Apache SparkBig Data Analytics with Apache Spark
Big Data Analytics with Apache SparkMarcoYuriFujiiMelo
 
Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Turi, Inc.
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part IIArjen de Vries
 
Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelDean Wampler
 
Analyzing_Data_with_Spark_and_Cassandra
Analyzing_Data_with_Spark_and_CassandraAnalyzing_Data_with_Spark_and_Cassandra
Analyzing_Data_with_Spark_and_CassandraRich Beaudoin
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangaloreappaji intelhunt
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)wqchen
 
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)Jeff Magnusson
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri
 
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...DataStax Academy
 
Machine Learning with Apache Spark - HackNY Masters
Machine Learning with Apache Spark - HackNY Masters Machine Learning with Apache Spark - HackNY Masters
Machine Learning with Apache Spark - HackNY Masters Evan Casey
 

Similar to slides_nyc_2015 (20)

Spark devoxx2014
Spark devoxx2014Spark devoxx2014
Spark devoxx2014
 
Spark training-in-bangalore
Spark training-in-bangaloreSpark training-in-bangalore
Spark training-in-bangalore
 
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
 
Ruby on Big Data (Cassandra + Hadoop)
Ruby on Big Data (Cassandra + Hadoop)Ruby on Big Data (Cassandra + Hadoop)
Ruby on Big Data (Cassandra + Hadoop)
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016
 
Apache Hadoop: DFS and Map Reduce
Apache Hadoop: DFS and Map ReduceApache Hadoop: DFS and Map Reduce
Apache Hadoop: DFS and Map Reduce
 
Big Data Analytics with Apache Spark
Big Data Analytics with Apache SparkBig Data Analytics with Apache Spark
Big Data Analytics with Apache Spark
 
Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
 
Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) Model
 
Analyzing_Data_with_Spark_and_Cassandra
Analyzing_Data_with_Spark_and_CassandraAnalyzing_Data_with_Spark_and_Cassandra
Analyzing_Data_with_Spark_and_Cassandra
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangalore
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
 
Hadoop
HadoopHadoop
Hadoop
 
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
 
Machine Learning with Apache Spark - HackNY Masters
Machine Learning with Apache Spark - HackNY Masters Machine Learning with Apache Spark - HackNY Masters
Machine Learning with Apache Spark - HackNY Masters
 

slides_nyc_2015