SlideShare a Scribd company logo
1 of 28
Iterative Spark Development at
Bloomberg
Nimbus Goehausen
Spark Platform Engineer
ngoehausen@bloomberg.net
Copyright 2016 Bloomberg L.P. All rights reserved.
A Data Pipeline
source outputProcess
output = Process(source)
Spark development cycle
Make code
change
Deploy jar
(1m)
Start spark
submit /
spark shell /
notebook
(1m)
Wait for
execution
(1m – several
hours)
Observe
results
This should work, right?
source outputProcess
output = Process(source)
What have I done to deserve this?
source outputProcess
java.lang.HotGarbageException: You messed up
at some.obscure.library.Util$$anonfun$morefun$1.apply(Util.scala:eleventyBillion)
at another.library.InputStream $$somuchfun$mcV$sp.read(InputStream.scala:42)
at org.apache.spark.serializer.PotatoSerializer.readPotato(PotatoSerializer.scala:2792)
…
Split it up
source outputStage2Stage1
Stage1
Output
Main1 Main2
stage1Out = stage1(source)
stage1Out.save(“stage1Path”)
stage1Out = load(“stage1Path”)
output = stage2(stage1Out)
Automatic checkpoint
source outputStage2Stage1
Stage1
Output
stage1Out = stage1(source).checkpoint()
Output = stage2(stage1Out)
First run
source outputStage2Stage1
Stage1
Output
stage1Out = stage1(source).checkpoint()
Output = stage2(stage1Out)
Second run
source outputStage2Stage1
Stage1
Output
stage1Out = stage1(source).checkpoint()
Output = stage2(stage1Out)
Change logic
source outputStage2
modified
Stage1
modified
Stage1
Output
stage1Out = stage1(source).checkpoint()
Output = stage2(stage1Out)
Stage1
Output
How do we do this?
• Need some way of automatically generating a
path to save to or load from.
• Should stay the same when we want to reload
the same data.
• Should change when we want to generate
new data.
Refresher: what defines an RDD
Dependencies
Logic
Result
Refresher: what defines an RDD
Dependencies
Logic
Result
[1,2,3,4,5…]
[2,4,6,8,10…]
map(x => x*2)
A logical signature
result = logic(dependencies)
signature(result) = hash(signature(dependencies) + hash(logic))
A logical signature
val rdd2 = rdd1.map(f)
signature(rdd2) == hash(signature(rdd1) + hash(f))
val sourceRDD = sc.textFile(uri)
signature(sourceRDD) == hash(“textFile:” + uri)
Wait, how do you hash a function?
val f = (x: Int) => x * 2
f compiles to a class file of jvm bytecode
Hash those bytes.
Change f to
val f = (x: Int) => x * 3
And the class file will change.
What if the function references other
functions or static methods?
• f = (x: Int) => Values.v + staticMethod(x) + g(x)
• This is the hard part.
• Must follow dependencies recursively, identify
all classfiles that the function depends on.
• Make a combined hash of the classfiles and
any runtime values.
Putting this together
source outputStage2Stage1
Stage1
Output
stage1Out = stage1(source).checkpoint()
Output = stage2(stage1Out)
How to switch in checkpoints
• One option: change spark.
• Other option: build on top of spark.
Distributed Collection (DC)
• Mostly same api as RDD, maps, flatmaps and
so on.
• Can be instantiated without a spark context.
• Pass in a spark context to get the
corresponding RDD.
• Automatically computes logical signature.
RDD vs DC
val numbers: DC[Int] = parallelize(1 to 10)
val filtered: DC[Int] = numbers.filter(_ < 5)
val doubled: DC[Int] = filtered.map(_ * 2)
val doubledRDD: RDD[Int] = doubled.getRDD(sc)
val numbers: RDD[Int] = sc.parallelize(1 to 10)
val filtered: RDD[Int] = numbers.filter(_ < 5)
val doubled: RDD[Int] = filtered.map(_ * 2)
Easy checkpoint
val numbers: DC[Int] = parallelize(1 to 10)
val filtered: DC[Int] = numbers.filter(_ < 5).checkpoint()
val doubled: DC[Int] = filtered.map(_ * 2)
Problem: Non lazy definition
val numbers: RDD[Int] = sc.parallelize(1 to 10)
val doubles: RDD[Double] = numbers.map(x => x.toDouble)
val sum: Double = doubles.sum()
val normalized: RDD[Double] = doubles.map(x => x / sum)
We can’t get a signature of `normalized` without computing the sum!
Solution: Deferred Result
• Defined by a function on an RDD
• Like a DC but contains only one element
• Pass in a spark context to compute the result
• Also has a logical signature
• We can now express entirely lazy pipelines
Deferred Result example
val numbers: DC[Int] = parallelize(1 to 10)
val doubles: DC[Double] = numbers.map(x => x.toDouble)
val sum: DR[Double] = doubles.sum()
val normalized: DC[Double] = doubles.withResult(sum).map{case (x, s) => x / s}
We can compute a signature for `normalized` without computing a sum.
Improving the workflow
• Avoid adding complexity in order to speed things up.
• Get the benefits of avoiding redundant computation, but
avoid the costs of managing intermediate results.
• Further automatic behaviors to explore, such as automatic
failure traps, and generation of common statistics.
Can I use this now?
• Not yet, come talk to me if you’re interested.
• Open source release to come.
Questions?
Nimbus Goehausen
ngoehausen@bloomberg.net
@nimbusgo on Twitter / Github
techatbloomberg.com

More Related Content

What's hot

The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)Mathieu Bastian
 
Scala introduction
Scala introductionScala introduction
Scala introductionvito jeng
 
Aaron Bedra - Effective Software Security Teams
Aaron Bedra - Effective Software Security TeamsAaron Bedra - Effective Software Security Teams
Aaron Bedra - Effective Software Security Teamscentralohioissa
 
Scaling up data science applications
Scaling up data science applicationsScaling up data science applications
Scaling up data science applicationsKexin Xie
 
Why you should be using structured logs
Why you should be using structured logsWhy you should be using structured logs
Why you should be using structured logsStefan Krawczyk
 
Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Databricks
 
RMySQL Tutorial For Beginners
RMySQL Tutorial For BeginnersRMySQL Tutorial For Beginners
RMySQL Tutorial For BeginnersRsquared Academy
 
Data in Motion: Streaming Static Data Efficiently
Data in Motion: Streaming Static Data EfficientlyData in Motion: Streaming Static Data Efficiently
Data in Motion: Streaming Static Data EfficientlyMartin Zapletal
 
Javascript Arrays
Javascript ArraysJavascript Arrays
Javascript Arraysshaheenakv
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0Petr Zapletal
 
What You Need to Know about Lambdas
What You Need to Know about LambdasWhat You Need to Know about Lambdas
What You Need to Know about LambdasRyan Knight
 
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark ApplicationsFuture Processing
 
Out ofmemoryerror what is the cost of java objects
Out ofmemoryerror  what is the cost of java objectsOut ofmemoryerror  what is the cost of java objects
Out ofmemoryerror what is the cost of java objectsJean-Philippe BEMPEL
 
Reactive Stream Processing Using DDS and Rx
Reactive Stream Processing Using DDS and RxReactive Stream Processing Using DDS and Rx
Reactive Stream Processing Using DDS and RxSumant Tambe
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...CloudxLab
 
Time Series Analysis for Network Secruity
Time Series Analysis for Network SecruityTime Series Analysis for Network Secruity
Time Series Analysis for Network Secruitymrphilroth
 
R Programming: Export/Output Data In R
R Programming: Export/Output Data In RR Programming: Export/Output Data In R
R Programming: Export/Output Data In RRsquared Academy
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiGrowth Intelligence
 

What's hot (20)

The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)
 
Scala introduction
Scala introductionScala introduction
Scala introduction
 
Aaron Bedra - Effective Software Security Teams
Aaron Bedra - Effective Software Security TeamsAaron Bedra - Effective Software Security Teams
Aaron Bedra - Effective Software Security Teams
 
Distributed computing with spark
Distributed computing with sparkDistributed computing with spark
Distributed computing with spark
 
Scaling up data science applications
Scaling up data science applicationsScaling up data science applications
Scaling up data science applications
 
Why you should be using structured logs
Why you should be using structured logsWhy you should be using structured logs
Why you should be using structured logs
 
D3.js workshop
D3.js workshopD3.js workshop
D3.js workshop
 
Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...
 
RMySQL Tutorial For Beginners
RMySQL Tutorial For BeginnersRMySQL Tutorial For Beginners
RMySQL Tutorial For Beginners
 
Data in Motion: Streaming Static Data Efficiently
Data in Motion: Streaming Static Data EfficientlyData in Motion: Streaming Static Data Efficiently
Data in Motion: Streaming Static Data Efficiently
 
Javascript Arrays
Javascript ArraysJavascript Arrays
Javascript Arrays
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0
 
What You Need to Know about Lambdas
What You Need to Know about LambdasWhat You Need to Know about Lambdas
What You Need to Know about Lambdas
 
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
 
Out ofmemoryerror what is the cost of java objects
Out ofmemoryerror  what is the cost of java objectsOut ofmemoryerror  what is the cost of java objects
Out ofmemoryerror what is the cost of java objects
 
Reactive Stream Processing Using DDS and Rx
Reactive Stream Processing Using DDS and RxReactive Stream Processing Using DDS and Rx
Reactive Stream Processing Using DDS and Rx
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
 
Time Series Analysis for Network Secruity
Time Series Analysis for Network SecruityTime Series Analysis for Network Secruity
Time Series Analysis for Network Secruity
 
R Programming: Export/Output Data In R
R Programming: Export/Output Data In RR Programming: Export/Output Data In R
R Programming: Export/Output Data In R
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
 

Viewers also liked

Big Data Day LA 2016 Keynote - Andy Feng/ Yahoo
Big Data Day LA 2016 Keynote - Andy Feng/ YahooBig Data Day LA 2016 Keynote - Andy Feng/ Yahoo
Big Data Day LA 2016 Keynote - Andy Feng/ YahooData Con LA
 
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...Data Con LA
 
Big Data Day LA 2016/ Data Science Track - Backstage to a Data Driven Culture...
Big Data Day LA 2016/ Data Science Track - Backstage to a Data Driven Culture...Big Data Day LA 2016/ Data Science Track - Backstage to a Data Driven Culture...
Big Data Day LA 2016/ Data Science Track - Backstage to a Data Driven Culture...Data Con LA
 
Big Data Day LA 2016/ Use Case Driven track - The Encyclopedia of World Probl...
Big Data Day LA 2016/ Use Case Driven track - The Encyclopedia of World Probl...Big Data Day LA 2016/ Use Case Driven track - The Encyclopedia of World Probl...
Big Data Day LA 2016/ Use Case Driven track - The Encyclopedia of World Probl...Data Con LA
 
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Data Con LA
 
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...Data Con LA
 
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...Data Con LA
 
Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...
Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...
Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...Data Con LA
 
Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...
Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...
Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...Data Con LA
 
Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...
Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...
Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...Data Con LA
 
Big Data Day LA 2016/ Data Science Track - Data Science + Hollywood, Todd Ho...
Big Data Day LA 2016/ Data Science Track -  Data Science + Hollywood, Todd Ho...Big Data Day LA 2016/ Data Science Track -  Data Science + Hollywood, Todd Ho...
Big Data Day LA 2016/ Data Science Track - Data Science + Hollywood, Todd Ho...Data Con LA
 
Big Data Day LA 2016/ Use Case Driven track - Data and Hollywood: "Je t'Aime ...
Big Data Day LA 2016/ Use Case Driven track - Data and Hollywood: "Je t'Aime ...Big Data Day LA 2016/ Use Case Driven track - Data and Hollywood: "Je t'Aime ...
Big Data Day LA 2016/ Use Case Driven track - Data and Hollywood: "Je t'Aime ...Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...Data Con LA
 
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...Data Con LA
 
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...Data Con LA
 
Big Data Day LA 2016 Keynote - Tom Horan/ Claremont Graduate University
Big Data Day LA 2016 Keynote - Tom Horan/ Claremont Graduate UniversityBig Data Day LA 2016 Keynote - Tom Horan/ Claremont Graduate University
Big Data Day LA 2016 Keynote - Tom Horan/ Claremont Graduate UniversityData Con LA
 
Apache Mesos and the new Open Source Architecture of the Modern Datacenter
Apache Mesos and the new Open Source Architecture of the Modern DatacenterApache Mesos and the new Open Source Architecture of the Modern Datacenter
Apache Mesos and the new Open Source Architecture of the Modern DatacenterData Con LA
 
Big Data Day LA 2016/ NoSQL track - Introduction to Graph Databases, Oren Gol...
Big Data Day LA 2016/ NoSQL track - Introduction to Graph Databases, Oren Gol...Big Data Day LA 2016/ NoSQL track - Introduction to Graph Databases, Oren Gol...
Big Data Day LA 2016/ NoSQL track - Introduction to Graph Databases, Oren Gol...Data Con LA
 

Viewers also liked (20)

Big Data Day LA 2016 Keynote - Andy Feng/ Yahoo
Big Data Day LA 2016 Keynote - Andy Feng/ YahooBig Data Day LA 2016 Keynote - Andy Feng/ Yahoo
Big Data Day LA 2016 Keynote - Andy Feng/ Yahoo
 
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
 
Big Data Day LA 2016/ Data Science Track - Backstage to a Data Driven Culture...
Big Data Day LA 2016/ Data Science Track - Backstage to a Data Driven Culture...Big Data Day LA 2016/ Data Science Track - Backstage to a Data Driven Culture...
Big Data Day LA 2016/ Data Science Track - Backstage to a Data Driven Culture...
 
Big Data Day LA 2016/ Use Case Driven track - The Encyclopedia of World Probl...
Big Data Day LA 2016/ Use Case Driven track - The Encyclopedia of World Probl...Big Data Day LA 2016/ Use Case Driven track - The Encyclopedia of World Probl...
Big Data Day LA 2016/ Use Case Driven track - The Encyclopedia of World Probl...
 
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
 
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...
 
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
 
Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...
Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...
Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...
 
Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...
Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...
Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...
 
Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...
Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...
Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...
 
Big Data Day LA 2016/ Data Science Track - Data Science + Hollywood, Todd Ho...
Big Data Day LA 2016/ Data Science Track -  Data Science + Hollywood, Todd Ho...Big Data Day LA 2016/ Data Science Track -  Data Science + Hollywood, Todd Ho...
Big Data Day LA 2016/ Data Science Track - Data Science + Hollywood, Todd Ho...
 
Big Data Day LA 2016/ Use Case Driven track - Data and Hollywood: "Je t'Aime ...
Big Data Day LA 2016/ Use Case Driven track - Data and Hollywood: "Je t'Aime ...Big Data Day LA 2016/ Use Case Driven track - Data and Hollywood: "Je t'Aime ...
Big Data Day LA 2016/ Use Case Driven track - Data and Hollywood: "Je t'Aime ...
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...
 
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
 
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
 
Big Data Day LA 2016 Keynote - Tom Horan/ Claremont Graduate University
Big Data Day LA 2016 Keynote - Tom Horan/ Claremont Graduate UniversityBig Data Day LA 2016 Keynote - Tom Horan/ Claremont Graduate University
Big Data Day LA 2016 Keynote - Tom Horan/ Claremont Graduate University
 
Apache Mesos and the new Open Source Architecture of the Modern Datacenter
Apache Mesos and the new Open Source Architecture of the Modern DatacenterApache Mesos and the new Open Source Architecture of the Modern Datacenter
Apache Mesos and the new Open Source Architecture of the Modern Datacenter
 
Big Data Day LA 2016/ NoSQL track - Introduction to Graph Databases, Oren Gol...
Big Data Day LA 2016/ NoSQL track - Introduction to Graph Databases, Oren Gol...Big Data Day LA 2016/ NoSQL track - Introduction to Graph Databases, Oren Gol...
Big Data Day LA 2016/ NoSQL track - Introduction to Graph Databases, Oren Gol...
 

Similar to Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Development at Bloomberg, Nimbus Goehausen, Senior Software Engineer, Bloomberg

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkDatio Big Data
 
Big Data Scala by the Bay: Interactive Spark in your Browser
Big Data Scala by the Bay: Interactive Spark in your BrowserBig Data Scala by the Bay: Interactive Spark in your Browser
Big Data Scala by the Bay: Interactive Spark in your Browsergethue
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2Fabio Fumarola
 
Lecture 5: Functional Programming
Lecture 5: Functional ProgrammingLecture 5: Functional Programming
Lecture 5: Functional ProgrammingEelco Visser
 
Apache Flink & Graph Processing
Apache Flink & Graph ProcessingApache Flink & Graph Processing
Apache Flink & Graph ProcessingVasia Kalavri
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and MonoidsHugo Gävert
 
Pydiomatic
PydiomaticPydiomatic
Pydiomaticrik0
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
Scala @ TechMeetup Edinburgh
Scala @ TechMeetup EdinburghScala @ TechMeetup Edinburgh
Scala @ TechMeetup EdinburghStuart Roebuck
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Stream processing from single node to a cluster
Stream processing from single node to a clusterStream processing from single node to a cluster
Stream processing from single node to a clusterGal Marder
 
Things about Functional JavaScript
Things about Functional JavaScriptThings about Functional JavaScript
Things about Functional JavaScriptChengHui Weng
 
Столпы функционального программирования для адептов ООП, Николай Мозговой
Столпы функционального программирования для адептов ООП, Николай МозговойСтолпы функционального программирования для адептов ООП, Николай Мозговой
Столпы функционального программирования для адептов ООП, Николай МозговойSigma Software
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2Gal Marder
 
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...Spark Summit
 
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...GeeksLab Odessa
 

Similar to Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Development at Bloomberg, Nimbus Goehausen, Senior Software Engineer, Bloomberg (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Big Data Scala by the Bay: Interactive Spark in your Browser
Big Data Scala by the Bay: Interactive Spark in your BrowserBig Data Scala by the Bay: Interactive Spark in your Browser
Big Data Scala by the Bay: Interactive Spark in your Browser
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
 
Spark workshop
Spark workshopSpark workshop
Spark workshop
 
Lecture 5: Functional Programming
Lecture 5: Functional ProgrammingLecture 5: Functional Programming
Lecture 5: Functional Programming
 
Apache Flink & Graph Processing
Apache Flink & Graph ProcessingApache Flink & Graph Processing
Apache Flink & Graph Processing
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
Pydiomatic
PydiomaticPydiomatic
Pydiomatic
 
Python idiomatico
Python idiomaticoPython idiomatico
Python idiomatico
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Scala @ TechMeetup Edinburgh
Scala @ TechMeetup EdinburghScala @ TechMeetup Edinburgh
Scala @ TechMeetup Edinburgh
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Stream processing from single node to a cluster
Stream processing from single node to a clusterStream processing from single node to a cluster
Stream processing from single node to a cluster
 
Things about Functional JavaScript
Things about Functional JavaScriptThings about Functional JavaScript
Things about Functional JavaScript
 
Столпы функционального программирования для адептов ООП, Николай Мозговой
Столпы функционального программирования для адептов ООП, Николай МозговойСтолпы функционального программирования для адептов ООП, Николай Мозговой
Столпы функционального программирования для адептов ООП, Николай Мозговой
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
 
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
 

More from Data Con LA

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA
 

More from Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Recently uploaded

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 

Recently uploaded (20)

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Development at Bloomberg, Nimbus Goehausen, Senior Software Engineer, Bloomberg

  • 1. Iterative Spark Development at Bloomberg Nimbus Goehausen Spark Platform Engineer ngoehausen@bloomberg.net Copyright 2016 Bloomberg L.P. All rights reserved.
  • 2. A Data Pipeline source outputProcess output = Process(source)
  • 3. Spark development cycle Make code change Deploy jar (1m) Start spark submit / spark shell / notebook (1m) Wait for execution (1m – several hours) Observe results
  • 4. This should work, right? source outputProcess output = Process(source)
  • 5. What have I done to deserve this? source outputProcess java.lang.HotGarbageException: You messed up at some.obscure.library.Util$$anonfun$morefun$1.apply(Util.scala:eleventyBillion) at another.library.InputStream $$somuchfun$mcV$sp.read(InputStream.scala:42) at org.apache.spark.serializer.PotatoSerializer.readPotato(PotatoSerializer.scala:2792) …
  • 6. Split it up source outputStage2Stage1 Stage1 Output Main1 Main2 stage1Out = stage1(source) stage1Out.save(“stage1Path”) stage1Out = load(“stage1Path”) output = stage2(stage1Out)
  • 7. Automatic checkpoint source outputStage2Stage1 Stage1 Output stage1Out = stage1(source).checkpoint() Output = stage2(stage1Out)
  • 8. First run source outputStage2Stage1 Stage1 Output stage1Out = stage1(source).checkpoint() Output = stage2(stage1Out)
  • 9. Second run source outputStage2Stage1 Stage1 Output stage1Out = stage1(source).checkpoint() Output = stage2(stage1Out)
  • 10. Change logic source outputStage2 modified Stage1 modified Stage1 Output stage1Out = stage1(source).checkpoint() Output = stage2(stage1Out) Stage1 Output
  • 11. How do we do this? • Need some way of automatically generating a path to save to or load from. • Should stay the same when we want to reload the same data. • Should change when we want to generate new data.
  • 12. Refresher: what defines an RDD Dependencies Logic Result
  • 13. Refresher: what defines an RDD Dependencies Logic Result [1,2,3,4,5…] [2,4,6,8,10…] map(x => x*2)
  • 14. A logical signature result = logic(dependencies) signature(result) = hash(signature(dependencies) + hash(logic))
  • 15. A logical signature val rdd2 = rdd1.map(f) signature(rdd2) == hash(signature(rdd1) + hash(f)) val sourceRDD = sc.textFile(uri) signature(sourceRDD) == hash(“textFile:” + uri)
  • 16. Wait, how do you hash a function? val f = (x: Int) => x * 2 f compiles to a class file of jvm bytecode Hash those bytes. Change f to val f = (x: Int) => x * 3 And the class file will change.
  • 17. What if the function references other functions or static methods? • f = (x: Int) => Values.v + staticMethod(x) + g(x) • This is the hard part. • Must follow dependencies recursively, identify all classfiles that the function depends on. • Make a combined hash of the classfiles and any runtime values.
  • 18. Putting this together source outputStage2Stage1 Stage1 Output stage1Out = stage1(source).checkpoint() Output = stage2(stage1Out)
  • 19. How to switch in checkpoints • One option: change spark. • Other option: build on top of spark.
  • 20. Distributed Collection (DC) • Mostly same api as RDD, maps, flatmaps and so on. • Can be instantiated without a spark context. • Pass in a spark context to get the corresponding RDD. • Automatically computes logical signature.
  • 21. RDD vs DC val numbers: DC[Int] = parallelize(1 to 10) val filtered: DC[Int] = numbers.filter(_ < 5) val doubled: DC[Int] = filtered.map(_ * 2) val doubledRDD: RDD[Int] = doubled.getRDD(sc) val numbers: RDD[Int] = sc.parallelize(1 to 10) val filtered: RDD[Int] = numbers.filter(_ < 5) val doubled: RDD[Int] = filtered.map(_ * 2)
  • 22. Easy checkpoint val numbers: DC[Int] = parallelize(1 to 10) val filtered: DC[Int] = numbers.filter(_ < 5).checkpoint() val doubled: DC[Int] = filtered.map(_ * 2)
  • 23. Problem: Non lazy definition val numbers: RDD[Int] = sc.parallelize(1 to 10) val doubles: RDD[Double] = numbers.map(x => x.toDouble) val sum: Double = doubles.sum() val normalized: RDD[Double] = doubles.map(x => x / sum) We can’t get a signature of `normalized` without computing the sum!
  • 24. Solution: Deferred Result • Defined by a function on an RDD • Like a DC but contains only one element • Pass in a spark context to compute the result • Also has a logical signature • We can now express entirely lazy pipelines
  • 25. Deferred Result example val numbers: DC[Int] = parallelize(1 to 10) val doubles: DC[Double] = numbers.map(x => x.toDouble) val sum: DR[Double] = doubles.sum() val normalized: DC[Double] = doubles.withResult(sum).map{case (x, s) => x / s} We can compute a signature for `normalized` without computing a sum.
  • 26. Improving the workflow • Avoid adding complexity in order to speed things up. • Get the benefits of avoiding redundant computation, but avoid the costs of managing intermediate results. • Further automatic behaviors to explore, such as automatic failure traps, and generation of common statistics.
  • 27. Can I use this now? • Not yet, come talk to me if you’re interested. • Open source release to come.

Editor's Notes

  1. Intro Bloomberg: premier data platform for financial professionals. Primary emphasis on accurate realtime results. Many data sources, lots of activity. Also a need for offline batch processing to build machine learning models and do backtesting on large datasets. Going to talk about data pipeline development
  2. Very broad view of an arbitrary data pipeline Talking about scala In a library, making a jar
  3. 5 cycle steps, make code change … Typical dev cycle if you’re making edits to a library. Certainly you would want to avoid checking every change with a full cluster run. Compiler checks, unit tests, are your friend. Running locally with a small sample is great, but you won’t catch everything. Plenty of problems that are difficult to anticipate or test without running on the cluster. 1 in 10000 element issues, things that blow up with a lot of data. Oh it works locally? That’s great but I really want to know that it works on the full dataset. We want to speed this up, let’s target the worst bottleneck, typically execution time.
  4. We’re in this cycle, it compiles, we’ve tested locally, we’ve got unit tests and we’re about to run on the cluster.
  5. If this doesn’t take long to run, no worries, just iterate quickly. If this does take a while to run, you’ve got problems.
  6. We’ve solved one problem, iterate quicker on later stages. We’ve introduced new problems by persisting intermediate results.