SlideShare a Scribd company logo
1 of 32
Download to read offline
Building Recoverable Pipelines
With Apache Spark
Holden Karau
Open Source Developer Advocate @ Google
Some links (slides & recordings
will be at):
http://bit.ly/2QMUaRc
^ Slides & Code
(only after the talk because early is hard)
Shkumbin Saneja
Holden:
▪ Prefered pronouns are she/her
▪ Developer Advocate at Google
▪ Apache Spark PMC/Committer, contribute to many other projects
▪ previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
▪ co-author of Learning Spark & High Performance Spark
▪ Twitter: @holdenkarau
▪ Slide share http://www.slideshare.net/hkarau
▪ Code review livestreams: https://www.twitch.tv/holdenkarau /
https://www.youtube.com/user/holdenkarau
▪ Spark Talk Videos http://bit.ly/holdenSparkVideos
Who y’all are?
▪ Nice folk
▪ Like databases of a certain kind
▪ Occasionally have big data jobs on your big data fail
mxmstryo
What are we going to explore?
▪ Brief: what is Spark and why it’s related to this conference
▪ Also brief: Some of the ways Spark can fail in hour 23
▪ Less brief: a first stab at making it recoverable
▪ How that goes boom
▪ Repeat ? times until it stops going boom
▪ Summary and github link
Stuart
What is Spark?
• General purpose distributed system
• With a really nice API including Python :)
• Apache project (one of the most active)
• Must faster than Hadoop Map/Reduce
• Good when too big for a single machine
• Built on top of two abstractions for
distributed data: RDDs & Datasets
The different pieces of Spark
Apache Spark
SQL, DataFrames & Datasets
Structured
Streaming
Scala,
Java,
Python, &
R
Spark ML
bagel &
Graph X
MLLib
Scala,
Java,
PythonStreaming
Graph
Frames
Paul Hudson
Why people come to Spark:
Well this MapReduce job
is going to take 16 hours -
how long could it take to
learn Spark?
dougwoods
Why people come to Spark:
My DataFrame won’t fit in
memory on my cluster
anymore, let alone my
MacBook Pro :( Maybe
this Spark business will
solve that...
brownpau
Big Data == Wordcount
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(" "))
word_count =
(words.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x+y))
word_count.saveAsTextFile(“output”)
Chris
Big Data != Wordcount
▪ ETL (keeping your databases in sync)
▪ SQL on top of non-SQL (hey what about if we added a SQL
engine to this?)
▪ ML - Everyone’s doing it, we should too
▪ DL - VC’s won’t give us money for ML anymore so we changed
its name
▪ But for this talk we’re just looking at Wordcount because it fits
on a slide
f ford Pinto by Morven
Why Spark fails & fails late
▪ Lazy evaluation can make predicting behaviour difficulty
▪ Out of memory errors (from JVM heap to container limits)
▪ Errors in our own code
▪ Driver failure
▪ Data size increases without required tuning changes
▪ Key-skew (consistent partitioning is a great idea right? Oh wait…)
▪ Serialization
▪ Limited type checking in non-JVM languages with ML pipelines
▪ etc.
f ford Pinto by Morven
ayphen
Why isn’t it recoverable?
▪ Seperate jobs - no files, no VMs, only sadness
▪ If same job (e.g. notebook failure and retry) cache & files
recovery
Jennifer C.
“Recoverable” Wordcount: Take 1
lines = sc.textFile(src)
words_raw = lines.flatMap(lambda x: x.split(" "))
words_path = "words"
if fs.exists(sc._jvm.org.apache.hadoop.fs.Path(words)):
words = sc.textFile(words_path)
else:
word.saveAsTextFile(words_path)
words = words_raw
# Continue with previous code
KLMircea
So what can we do better?
▪ Well if the pipeline fails in certain ways this will fail
▪ We don’t have any clean up on success
▪ sc._jvm is weird
▪ Functions -- the future!
▪ Not async
Jennifer C.
“Recoverable” Wordcount: Take 2
lines = sc.textFile(src)
words_raw = lines.flatMap(lambda x: x.split(" "))
words_path = "words/SUCCESS.txt"
if fs.exists(sc._jvm.org.apache.hadoop.fs.Path(words)):
words = sc.textFile(words_path)
else:
word.saveAsTextFile(words_path)
words = words_raw
# Continue with previous code
Susanne Nilsson
So what can we do better?
▪ Well if the pipeline fails in certain ways this will fail
• Fixed
▪ We don’t have any clean up on success
• ….
▪ sc._jvm is weird
• Yeah we’re not fixing this one unless we use scala
▪ Functions -- the future!
• sure!
▪ Have to wait to finish writing file
• Hold your horses
ivva
“Recoverable” [X]: Take 3
def non_blocking_df_save_or_load(df, target):
success_files = ["{0}/SUCCESS.txt", "{0}/_SUCCESS"]
if any(fs.exists(hadoop_fs_path(t.format(target))) for t in
success_files):
print("Reusing")
return session.read.load(target).persist()
else:
print("Saving")
df.save(target)
return df
Jennifer C.
So what can we do better?
▪ Try and not slow down our code on the happy path
• async?
▪ Cleanup on success (damn meant to do that earlier)
hkase
Adding async?
def non_blocking_df_save(df, target):
import threading
def save_panda():
df.write.mode("overwrite").save(target)
thread = threading.Thread(target=save_panda)
thread.start()
What could go wrong?
▪ Turns out… a lot
▪ Multiple executions on the DAG are not super great
(getting better but)
▪ How do we work around this?
Spark’s (core) magic: the DAG
▪ In Spark most of our work is done by transformations
• Things like map
▪ Transformations return new RDDs or DataFrames representing
this data
▪ The RDD or DataFrame however doesn’t really “exist”
▪ RDD & DataFrames are really just “plans” of how to make the
data show up if we force Spark’s hand
▪ tl;dr - the data doesn’t exist until it “has” to
Photo by Dan G
The DAG The query
plan Susanne Nilsson
cache + sync count + async save
def non_blocking_df_save_or_load(df, target):
s = "{0}/SUCCESS.txt"
if fs.exists(hadoop_fs_path(s.format(target))):
return session.read.load(target).persist()
else:
print("Saving")
df.cache()
df.count()
non_blocking_df_save(df, target)
return df
Well that was “fun”?
▪ Replace wordcount with your back-fill operation and it
becomes less fun
▪ You also need to clean up the files
▪ Use job IDS to avoid stomping on other jobs
Spark Videos
▪ Apache Spark Youtube Channel
▪ My Spark videos on YouTube -
• http://bit.ly/holdenSparkVideos
▪ Spark Summit 2014 training
▪ Paco’s Introduction to Apache Spark
Paul Anderson
Learning Spark
Fast Data Processing
with Spark
(Out of Date)
Fast Data Processing with
Spark (2nd edition)
Advanced
Analytics with
Spark
Spark in Action
High Performance
Spark
Learning PySpark
I also have a book...
High Performance Spark, it’s available today & the gift of
the season.
Unrelated to this talk, but if you have a corporate credit
card (and or care about distributed systems)….
http://bit.ly/hkHighPerfSpark
Cat wave photo by Quinn Dombrowski
k thnx bye!
If you <3 Spark testing & want to fill out
survey: http://bit.ly/holdenTestingSpark
Want to tell me (and or my boss) how
I’m doing?
http://bit.ly/holdenTalkFeedback
Want to e-mail me?
Promise not to be creepy? Ok:
holden@pigscanfly.ca

More Related Content

What's hot

Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau
 
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Debugging Spark:  Scala and Python - Super Happy Fun Times @ Data Day Texas 2018Debugging Spark:  Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Holden Karau
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
Holden Karau
 
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
 Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark... Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
Databricks
 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Holden Karau
 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user group
Hadoop User Group
 

What's hot (20)

Spark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New YorkSpark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New York
 
Contributing to Apache Spark 3
Contributing to Apache Spark 3Contributing to Apache Spark 3
Contributing to Apache Spark 3
 
Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018
Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018
Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018
 
PySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupPySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March Meetup
 
Powering tensor flow with big data using apache beam, flink, and spark cern...
Powering tensor flow with big data using apache beam, flink, and spark   cern...Powering tensor flow with big data using apache beam, flink, and spark   cern...
Powering tensor flow with big data using apache beam, flink, and spark cern...
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
 
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Debugging Spark:  Scala and Python - Super Happy Fun Times @ Data Day Texas 2018Debugging Spark:  Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
 
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
 Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark... Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
 
Debugging Apache Spark - Scala & Python super happy fun times 2017
Debugging Apache Spark -   Scala & Python super happy fun times 2017Debugging Apache Spark -   Scala & Python super happy fun times 2017
Debugging Apache Spark - Scala & Python super happy fun times 2017
 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
 
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user group
 
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
 
Node.js: CAMTA Presentation
Node.js: CAMTA PresentationNode.js: CAMTA Presentation
Node.js: CAMTA Presentation
 
Os Whitaker
Os WhitakerOs Whitaker
Os Whitaker
 
Whats wrong with postgres | PGConf EU 2019 | Craig Kerstiens
Whats wrong with postgres | PGConf EU 2019 | Craig KerstiensWhats wrong with postgres | PGConf EU 2019 | Craig Kerstiens
Whats wrong with postgres | PGConf EU 2019 | Craig Kerstiens
 

Similar to Building Recoverable (and optionally async) Pipelines with Apache Spark (+ small revisions)

Similar to Building Recoverable (and optionally async) Pipelines with Apache Spark (+ small revisions) (20)

Scylla Summit 2018: Building Recoverable (and optionally Async) Spark Pipelines
Scylla Summit 2018: Building Recoverable (and optionally Async) Spark PipelinesScylla Summit 2018: Building Recoverable (and optionally Async) Spark Pipelines
Scylla Summit 2018: Building Recoverable (and optionally Async) Spark Pipelines
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAM
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFs
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuck
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Introduction To Spark - Durham LUG 20150916
Introduction To Spark - Durham LUG 20150916Introduction To Spark - Durham LUG 20150916
Introduction To Spark - Durham LUG 20150916
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
 
Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
 
Spark autotuning talk final
Spark autotuning talk finalSpark autotuning talk final
Spark autotuning talk final
 
Introduction to and Extending Spark ML
Introduction to and Extending Spark MLIntroduction to and Extending Spark ML
Introduction to and Extending Spark ML
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016
 

Recently uploaded

pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
JOHNBEBONYAP1
 
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
ydyuyu
 
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu DhabiAbu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Monica Sydney
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
ydyuyu
 
一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理
F
 
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
ayvbos
 
一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理
F
 
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
ayvbos
 

Recently uploaded (20)

pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
 
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
 
APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53
 
20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf
 
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
 
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
 
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrStory Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
 
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac RoomVip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
 
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu DhabiAbu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
 
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime NagercoilNagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
 
Call girls Service in Ajman 0505086370 Ajman call girls
Call girls Service in Ajman 0505086370 Ajman call girlsCall girls Service in Ajman 0505086370 Ajman call girls
Call girls Service in Ajman 0505086370 Ajman call girls
 
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency""Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
 
一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理
 
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
 
Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Ballia
Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime BalliaBallia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Ballia
Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Ballia
 
一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理
 
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
 
Best SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency DallasBest SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency Dallas
 

Building Recoverable (and optionally async) Pipelines with Apache Spark (+ small revisions)

  • 1. Building Recoverable Pipelines With Apache Spark Holden Karau Open Source Developer Advocate @ Google
  • 2. Some links (slides & recordings will be at): http://bit.ly/2QMUaRc ^ Slides & Code (only after the talk because early is hard) Shkumbin Saneja
  • 3. Holden: ▪ Prefered pronouns are she/her ▪ Developer Advocate at Google ▪ Apache Spark PMC/Committer, contribute to many other projects ▪ previously IBM, Alpine, Databricks, Google, Foursquare & Amazon ▪ co-author of Learning Spark & High Performance Spark ▪ Twitter: @holdenkarau ▪ Slide share http://www.slideshare.net/hkarau ▪ Code review livestreams: https://www.twitch.tv/holdenkarau / https://www.youtube.com/user/holdenkarau ▪ Spark Talk Videos http://bit.ly/holdenSparkVideos
  • 4.
  • 5. Who y’all are? ▪ Nice folk ▪ Like databases of a certain kind ▪ Occasionally have big data jobs on your big data fail mxmstryo
  • 6. What are we going to explore? ▪ Brief: what is Spark and why it’s related to this conference ▪ Also brief: Some of the ways Spark can fail in hour 23 ▪ Less brief: a first stab at making it recoverable ▪ How that goes boom ▪ Repeat ? times until it stops going boom ▪ Summary and github link Stuart
  • 7. What is Spark? • General purpose distributed system • With a really nice API including Python :) • Apache project (one of the most active) • Must faster than Hadoop Map/Reduce • Good when too big for a single machine • Built on top of two abstractions for distributed data: RDDs & Datasets
  • 8. The different pieces of Spark Apache Spark SQL, DataFrames & Datasets Structured Streaming Scala, Java, Python, & R Spark ML bagel & Graph X MLLib Scala, Java, PythonStreaming Graph Frames Paul Hudson
  • 9. Why people come to Spark: Well this MapReduce job is going to take 16 hours - how long could it take to learn Spark? dougwoods
  • 10. Why people come to Spark: My DataFrame won’t fit in memory on my cluster anymore, let alone my MacBook Pro :( Maybe this Spark business will solve that... brownpau
  • 11. Big Data == Wordcount lines = sc.textFile(src) words = lines.flatMap(lambda x: x.split(" ")) word_count = (words.map(lambda x: (x, 1)) .reduceByKey(lambda x, y: x+y)) word_count.saveAsTextFile(“output”) Chris
  • 12. Big Data != Wordcount ▪ ETL (keeping your databases in sync) ▪ SQL on top of non-SQL (hey what about if we added a SQL engine to this?) ▪ ML - Everyone’s doing it, we should too ▪ DL - VC’s won’t give us money for ML anymore so we changed its name ▪ But for this talk we’re just looking at Wordcount because it fits on a slide
  • 13. f ford Pinto by Morven
  • 14. Why Spark fails & fails late ▪ Lazy evaluation can make predicting behaviour difficulty ▪ Out of memory errors (from JVM heap to container limits) ▪ Errors in our own code ▪ Driver failure ▪ Data size increases without required tuning changes ▪ Key-skew (consistent partitioning is a great idea right? Oh wait…) ▪ Serialization ▪ Limited type checking in non-JVM languages with ML pipelines ▪ etc.
  • 15. f ford Pinto by Morven ayphen
  • 16. Why isn’t it recoverable? ▪ Seperate jobs - no files, no VMs, only sadness ▪ If same job (e.g. notebook failure and retry) cache & files recovery Jennifer C.
  • 17. “Recoverable” Wordcount: Take 1 lines = sc.textFile(src) words_raw = lines.flatMap(lambda x: x.split(" ")) words_path = "words" if fs.exists(sc._jvm.org.apache.hadoop.fs.Path(words)): words = sc.textFile(words_path) else: word.saveAsTextFile(words_path) words = words_raw # Continue with previous code KLMircea
  • 18. So what can we do better? ▪ Well if the pipeline fails in certain ways this will fail ▪ We don’t have any clean up on success ▪ sc._jvm is weird ▪ Functions -- the future! ▪ Not async Jennifer C.
  • 19. “Recoverable” Wordcount: Take 2 lines = sc.textFile(src) words_raw = lines.flatMap(lambda x: x.split(" ")) words_path = "words/SUCCESS.txt" if fs.exists(sc._jvm.org.apache.hadoop.fs.Path(words)): words = sc.textFile(words_path) else: word.saveAsTextFile(words_path) words = words_raw # Continue with previous code Susanne Nilsson
  • 20. So what can we do better? ▪ Well if the pipeline fails in certain ways this will fail • Fixed ▪ We don’t have any clean up on success • …. ▪ sc._jvm is weird • Yeah we’re not fixing this one unless we use scala ▪ Functions -- the future! • sure! ▪ Have to wait to finish writing file • Hold your horses ivva
  • 21. “Recoverable” [X]: Take 3 def non_blocking_df_save_or_load(df, target): success_files = ["{0}/SUCCESS.txt", "{0}/_SUCCESS"] if any(fs.exists(hadoop_fs_path(t.format(target))) for t in success_files): print("Reusing") return session.read.load(target).persist() else: print("Saving") df.save(target) return df Jennifer C.
  • 22. So what can we do better? ▪ Try and not slow down our code on the happy path • async? ▪ Cleanup on success (damn meant to do that earlier) hkase
  • 23. Adding async? def non_blocking_df_save(df, target): import threading def save_panda(): df.write.mode("overwrite").save(target) thread = threading.Thread(target=save_panda) thread.start()
  • 24. What could go wrong? ▪ Turns out… a lot ▪ Multiple executions on the DAG are not super great (getting better but) ▪ How do we work around this?
  • 25. Spark’s (core) magic: the DAG ▪ In Spark most of our work is done by transformations • Things like map ▪ Transformations return new RDDs or DataFrames representing this data ▪ The RDD or DataFrame however doesn’t really “exist” ▪ RDD & DataFrames are really just “plans” of how to make the data show up if we force Spark’s hand ▪ tl;dr - the data doesn’t exist until it “has” to Photo by Dan G
  • 26. The DAG The query plan Susanne Nilsson
  • 27. cache + sync count + async save def non_blocking_df_save_or_load(df, target): s = "{0}/SUCCESS.txt" if fs.exists(hadoop_fs_path(s.format(target))): return session.read.load(target).persist() else: print("Saving") df.cache() df.count() non_blocking_df_save(df, target) return df
  • 28. Well that was “fun”? ▪ Replace wordcount with your back-fill operation and it becomes less fun ▪ You also need to clean up the files ▪ Use job IDS to avoid stomping on other jobs
  • 29. Spark Videos ▪ Apache Spark Youtube Channel ▪ My Spark videos on YouTube - • http://bit.ly/holdenSparkVideos ▪ Spark Summit 2014 training ▪ Paco’s Introduction to Apache Spark Paul Anderson
  • 30. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Spark in Action High Performance Spark Learning PySpark
  • 31. I also have a book... High Performance Spark, it’s available today & the gift of the season. Unrelated to this talk, but if you have a corporate credit card (and or care about distributed systems)…. http://bit.ly/hkHighPerfSpark
  • 32. Cat wave photo by Quinn Dombrowski k thnx bye! If you <3 Spark testing & want to fill out survey: http://bit.ly/holdenTestingSpark Want to tell me (and or my boss) how I’m doing? http://bit.ly/holdenTalkFeedback Want to e-mail me? Promise not to be creepy? Ok: holden@pigscanfly.ca