Have you ever had a Spark job fail in it’s second to last stage after a “trivial” update or been part of the way through debugging a pipeline to wish you could look at it’s data or had an “exploratory” notebook turn into something less exploratory? Come join me for a surprisingly simple adventure into how to build recoverable pipelines and have more debuggable pipelines. Then join me on the adventure where in we find out our “simple” solution has a bunch of hidden flaws, how to work around them, and end on the reminder of how important it is to test your code.
2. Some links (slides & recordings
will be at):
http://bit.ly/2QMUaRc
^ Slides & Code
(only after the talk because early is hard)
Shkumbin Saneja
3. Holden:
Prefered pronouns are she/her
Developer Advocate at Google
Apache Spark PMC/Committer, contribute to many other projects
previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
co-author of Learning Spark & High Performance Spark
Twitter: @holdenkarau
Slide share http://www.slideshare.net/hkarau
Code review livestreams: https://www.twitch.tv/holdenkarau /
https://www.youtube.com/user/holdenkarau
Spark Talk Videos http://bit.ly/holdenSparkVideos
4.
5. Who y’all are?
Nice folk
Like databases of a certain kind
Occasionally have big data jobs on your big data fail
mxmstryo
6. What are we going to explore?
Brief: what is Spark and why it’s related to this conference
Also brief: Some of the ways Spark can fail in hour 23
Less brief: a first stab at making it recoverable
How that goes boom
Repeat ? times until it stops going boom
Summary and github link
Stuart
7. What is Spark?
• General purpose distributed system
• With a really nice API including Python :)
• Apache project (one of the most active)
• Must faster than Hadoop Map/Reduce
• Good when too big for a single machine
• Built on top of two abstractions for
distributed data: RDDs & Datasets
8. The different pieces of Spark
Apache Spark
SQL, DataFrames & Datasets
Structured
Streaming
Scala,
Java,
Python, &
R
Spark ML
bagel &
Graph X
MLLib
Scala,
Java,
PythonStreaming
Graph
Frames
Paul Hudson
9. Why people come to Spark:
Well this MapReduce job
is going to take 16 hours -
how long could it take to
learn Spark?
dougwoods
10. Why people come to Spark:
My DataFrame won’t fit in
memory on my cluster
anymore, let alone my
MacBook Pro :( Maybe
this Spark business will
solve that...
brownpau
11. Big Data == Wordcount
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(" "))
word_count =
(words.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x+y))
word_count.saveAsTextFile(“output”)
Chris
12. Big Data != Wordcount
▪ ETL (keeping your databases in sync)
▪ SQL on top of non-SQL (hey what about if we added a SQL
engine to this?)
▪ ML - Everyone’s doing it, we should too
▪ DL - VC’s won’t give us money for ML anymore so we changed
its name
▪ But for this talk we’re just looking at Wordcount because it fits
on a slide
14. Why Spark fails & fails late
▪ Lazy evaluation can make predicting behaviour difficulty
▪ Out of memory errors (from JVM heap to container limits)
▪ Errors in our own code
▪ Driver failure
▪ Data size increases without required tuning changes
▪ Key-skew (consistent partitioning is a great idea right? Oh wait…)
▪ Serialization
▪ Limited type checking in non-JVM languages with ML pipelines
▪ etc.
16. Why isn’t it recoverable?
▪ Seperate jobs - no files, no VMs, only sadness
▪ If same job (e.g. notebook failure and retry) cache & files
recovery
Jennifer C.
17. “Recoverable” Wordcount: Take 1
lines = sc.textFile(src)
words_raw = lines.flatMap(lambda x: x.split(" "))
words_path = "words"
if fs.exists(sc._jvm.org.apache.hadoop.fs.Path(words)):
words = sc.textFile(words_path)
else:
words = words_raw
# Continue with previous code
KLMircea
18. So what can we do better?
▪ Well if the pipeline fails in certain ways this will fail
▪ We don’t have any clean up on success
▪ sc._jvm is weird
▪ Functions -- the future!
▪ Not async
Jennifer C.
19. “Recoverable” Wordcount: Take 2
lines = sc.textFile(src)
words_raw = lines.flatMap(lambda x: x.split(" "))
words_path = "words/SUCCESS.txt"
if fs.exists(sc._jvm.org.apache.hadoop.fs.Path(words)):
words = sc.textFile(words_path)
else:
words = words_raw
# Continue with previous code
Susanne Nilsson
20. So what can we do better?
▪ Well if the pipeline fails in certain ways this will fail
• Fixed
▪ We don’t have any clean up on success
• ….
▪ sc._jvm is weird
• Yeah we’re not fixing this one unless we use scala
▪ Functions -- the future!
• sure!
▪ Have to wait to finish writing file
• Hold your horses
ivva
21. “Recoverable” [X]: Take 3
def non_blocking_df_save_or_load(df, target):
success_files = ["{0}/SUCCESS.txt", "{0}/_SUCCESS"]
if any(fs.exists(hadoop_fs_path(t.format(target))) for t in
success_files):
print("Reusing")
return session.read.load(target).persist()
else:
print("Saving")
df.save(target)
return df
Jennifer C.
22. So what can we do better?
▪ Try and not slow down our code on the happy path
• async?
▪ Cleanup on success (damn meant to do that earlier)
hkase
24. What could go wrong?
▪ Turns out… a lot
▪ Multiple executions on the DAG are not super great
(getting better but)
▪ How do we work around this?
25. Spark’s (core) magic: the DAG
▪ In Spark most of our work is done by transformations
• Things like map
▪ Transformations return new RDDs or DataFrames representing
this data
▪ The RDD or DataFrame however doesn’t really “exist”
▪ RDD & DataFrames are really just “plans” of how to make the
data show up if we force Spark’s hand
▪ tl;dr - the data doesn’t exist until it “has” to
Photo by Dan G
27. cache + sync count + async save
def non_blocking_df_save_or_load(df, target):
s = "{0}/SUCCESS.txt"
if fs.exists(hadoop_fs_path(s.format(target))):
return session.read.load(target).persist()
else:
print("Saving")
df.cache()
df.count()
non_blocking_df_save(df, target)
return df
28. Well that was “fun”?
▪ Replace wordcount with your back-fill operation and it
becomes less fun
▪ You also need to clean up the files
▪ Use job IDS to avoid stomping on other jobs
29. Spark Videos
▪ Apache Spark Youtube Channel
▪ My Spark videos on YouTube -
• http://bit.ly/holdenSparkVideos
▪ Spark Summit 2014 training
▪ Paco’s Introduction to Apache Spark
Paul Anderson
30. Learning Spark
Fast Data Processing
with Spark
(Out of Date)
Fast Data Processing with
Spark (2nd edition)
Advanced
Analytics with
Spark
Spark in Action
High Performance
Spark
Learning PySpark
31. I also have a book...
High Performance Spark, it’s available today & the gift of
the season.
Unrelated to this talk, but if you have a corporate credit
card (and or care about distributed systems)….
http://bit.ly/hkHighPerfSpark
32. Cat wave photo by Quinn Dombrowski
k thnx bye!
If you <3 Spark testing & want to fill out
survey: http://bit.ly/holdenTestingSpark
Want to tell me (and or my boss) how
I’m doing?
http://bit.ly/holdenTalkFeedback
Want to e-mail me?
Promise not to be creepy? Ok:
holden@pigscanfly.ca
Editor's Notes
Lawyer cat remands that I point out the eventual summary and github link is indeed a proof of concept only