Fault Tolerance in Spark:
Lessons Learned
from Production
José Soltren
Cloudera
Who am I?
• Software Engineer at Cloudera focused on
Apache Spark
– …also an Apache Spark contributor
• Previous hardware, kernel, and driver hacking
experience
So… why does Cloudera care about
Fault Tolerance?
• Cloudera supports big customers running big
applications on big hardware.
• How big?
– >$1B/yr,
– core business logic
– 1000+ node clusters.
• Outages are really expensive.
– …about as expensive as flying a small jet.
• Customer’sproblems are our problems.
Apache Spark
Fault Tolerance Basics
Fault Tolerance Basics
https://0x0fff.com/spark-architecture-shuffle/
Fault Tolerance Basics
• RDDs – Resilient Distributed Datasets: multiple
pieces, multiple copies
• Lineage: not versions, but, a way of re-creating data
• HDFS (or HBase or other external store)
• Scheduler: Blacklist
• Scheduler/Storage: Duplication and Locality
.gs.
s9q)
q)
.-
!

(-
s
'€
->
er+
A
sr
srBlaq
9/.n,?-rt
)-y
rIt
?.
t--
s--
$
(-
f;
O
Lt-.
cf)
+
_g
,e
-l
{so
F
?
ET'L
t<
{,
J
c-
a
)
Na
,€
rntJ
c
t;
ss
A
-s/
s
Nv
trD
tI
.l
E3l
€it*."1
tr*l
3Fl'.:l"
rl
tlc..l
.J
glol--f'l
-Yl
+l
s
c-)
€
tY
s
/q3
3
0
5.qJ
g
i
(
E(-
s
t
&
3
..r-
5-
-t)
t-
-+-
,f
e
)
-+
I
.trc.t
b
E
j(-
-
3E3-E
{
ET
F
€
x
e
II
.-{dsA,F<-{
_-e-<*€g_bLF
€*J
ifr
tsJ
a-
;lly
?X-)
g.R.r{'>
,:rF:(S
:sEt
=.-JA
")
(^
55
a
L
.c-
,
5
E
1
L
a
>
3*:*d
,g
-Es(f-
r€*d
s-rt
tr
f'l
-+l-Y
f
I
Il-
I-d
i-
F+-
I+(

I

I
{
I
:ll
.-5',
?Pd
FtS
E$t
ee-l
ic<3.
2
P
C-.
t
€
{
t<
qf,
r
rt
sf
'A
-t
=.F
c
q
c-^
-*-:s
=f
-;
In
production,
at scale.
Case Study:
Application Outage
[SPARK-8425]
2016-04-22: An Application Outage
• Customer reports an application failure.
– Spark cluster with hundreds of nodes and 8 disks per node.
– Application runs on customer’s Spark cluster.
– High Availability is critical for this application.
• Immediate cause was a disk failure:
FileNotFoundException.
• “HDFS and YARN responded to the disk failure
appropriately but Spark continued to access the failed
disk.”
– Yikes.
Disk Failures and Scheduling
• One node has one bad disk.
– …tasks sometimes fail on this node.
– ...tasks consistently fail if they hit this disk.
• Tasks succeed if they are scheduled on other nodes!
• Tasks fail if they are scheduled on the same node.
– ...which is likely due to locality preferences.
• The scheduler will kill the whole job after some number
of failures.
• Can’t tell YARN (Mesos?) we have a bad resource.
Failure Recap
• There is already support for fault tolerance present –
don’t panic!
• Spark could handle an unreachable node.
• Spark could handle a node with no usable disks.
• We hit an edge case.
• Failure modes are binary, and not expressive enough.
Yuck.
:(
Short Term:
Workaround on Spark 1.6 and 2.0
• spark.scheduler.executorTaskBlacklistTime
– Set a value that is much longer than the duration of the longest task.
– Tells the scheduler to “blacklist” an (executor, task) combination for some
amount of time.
• spark.task.maxFailures
– Set a value that is larger than the maximum number of executors on a node.
– Determines the number of failures before the whole application is killed.
• spark.speculation
– Defaults to false. Did not recommend enabling.
Long Term:
Overhaul The Blacklist
• The Scheduler is critical core code! Bug whack-a-mole.
• Driver multi-threaded, asynchronous requests.
• Many scenarios considered in Design Doc http://bit.do/bklist
– Large Cluster, Large Stages, One Bad Disk (our scenario)
– Failing Job, Very Small Cluster
– Large Cluster, Very Small Stages
– Long Lived Application, Occasional Failed Tasks
– Bad Node leads to widespread Shuffle-Fetch Failures
– Bad Node, One Executor, Dynamic Allocation
– Application programming errors!
org.apache.spark.scheduler.
BlacklistTracker
/**
* BlacklistTracker is designed to track problematic executors and nodes. It supports blacklisting
* executors and nodes across an entire application (with a periodic expiry). TaskSetManagers add
* additional blacklisting of executors and nodes for individual tasks and stages which works in
* concert with the blacklisting here.
*
* The tracker needs to deal with a variety of workloads, eg.:
*
* * bad user code -- this may lead to many task failures, but that should not count against
* individual executors
* * many small stages -- this may prevent a bad executor for having many failures within one
* stage, but still many failures over the entire application
* * "flaky" executors -- they don't fail every task, but are still faulty enough to merit
* blacklisting
* * See the design doc on SPARK-8425 for a more in-depth discussion.
*
* THREADING: As with most helpers of TaskSchedulerImpl, this is not thread-safe. Though it is
* called by multiple threads, callers must already have a lock on the TaskSchedulerImpl. The
* one exception is [[nodeBlacklist()]], which can be called without holding a lock.
*/
Long Term:
Scheduler Improvements
• New in Spark 2.2 (under development June – December 2016)
• spark.blacklist.enabled (Default: false)
• spark.blacklist.task.maxTaskAttemptsPerExecutor (Default: 1)
• spark.blacklist.task.maxTaskAttemptsPerNode (Default: 2)
• spark.blacklist.task.maxFailedTasksPerExecutor (Default: 2)
• spark.blacklist.task.maxFailedExecutorsPerNode (Default: 2)
– http://spark.apache.org/docs/latest/configuration.html
Blacklist
Web UI
Case Study:
Shuffle Fetch Failures
[SPARK-4105]
SPARK-4105: The Symptom
• Non-deterministic
FAILED_TO_UNCOMPRESS(5) errors during
shuffle read.
• Difficult to reproduce.
• “Smells” like stream corruption.
– Some users saw similar issues
with LZF compression.
• Not related to spilling.
https://0x0fff.com/spark-architecture-shuffle/
SPARK-4105: Fixed in Spark 2.2
• https://github.com/apache/spark/pull/15923
• “It seems that it's very likely the corruption is
introduced by some weird machine/hardware, also
the checksum (16 bits) in TCP is not strong enough
to identify all the corruption.”
• Try to decompressblocks as they come in and
check for IOExceptions.
• Works for now, maybe we can do better.
Closing thoughts.
What did we learn?
• The Spark Scheduler is responsible for assigning units of
work to compute resources.
• The scheduler is where the rubber meets the road when it
comes to fault tolerance.
• There are a few knobs to tweak, but hopefully that is not
necessary.
• Other things can fail besides the scheduler, too.
• Many classical distributed systems problems are still present
(even though Spark does a great job of abstracting most of
them away).
Recommendations for
Application Developers
• Gather and read logs, early and often.
– Issues may occur in smaller environments.
• Start small: one executor, one host.
• Grow slowly.
• Use “pen and paper” to determine expectations for job times.
• Watch out for stragglers, outliers, and crashes.
• Don’t: start critical job on huge cluster and expect perfect
performance the first time.
Thank You.
José Soltren
jose@cloudera.com

Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East talk by Jose Soltren

  • 1.
    Fault Tolerance inSpark: Lessons Learned from Production José Soltren Cloudera
  • 2.
    Who am I? •Software Engineer at Cloudera focused on Apache Spark – …also an Apache Spark contributor • Previous hardware, kernel, and driver hacking experience
  • 3.
    So… why doesCloudera care about Fault Tolerance? • Cloudera supports big customers running big applications on big hardware. • How big? – >$1B/yr, – core business logic – 1000+ node clusters. • Outages are really expensive. – …about as expensive as flying a small jet. • Customer’sproblems are our problems.
  • 4.
  • 5.
  • 6.
    Fault Tolerance Basics •RDDs – Resilient Distributed Datasets: multiple pieces, multiple copies • Lineage: not versions, but, a way of re-creating data • HDFS (or HBase or other external store) • Scheduler: Blacklist • Scheduler/Storage: Duplication and Locality
  • 7.
  • 8.
  • 9.
  • 10.
    2016-04-22: An ApplicationOutage • Customer reports an application failure. – Spark cluster with hundreds of nodes and 8 disks per node. – Application runs on customer’s Spark cluster. – High Availability is critical for this application. • Immediate cause was a disk failure: FileNotFoundException. • “HDFS and YARN responded to the disk failure appropriately but Spark continued to access the failed disk.” – Yikes.
  • 11.
    Disk Failures andScheduling • One node has one bad disk. – …tasks sometimes fail on this node. – ...tasks consistently fail if they hit this disk. • Tasks succeed if they are scheduled on other nodes! • Tasks fail if they are scheduled on the same node. – ...which is likely due to locality preferences. • The scheduler will kill the whole job after some number of failures. • Can’t tell YARN (Mesos?) we have a bad resource.
  • 12.
    Failure Recap • Thereis already support for fault tolerance present – don’t panic! • Spark could handle an unreachable node. • Spark could handle a node with no usable disks. • We hit an edge case. • Failure modes are binary, and not expressive enough.
  • 13.
  • 14.
    Short Term: Workaround onSpark 1.6 and 2.0 • spark.scheduler.executorTaskBlacklistTime – Set a value that is much longer than the duration of the longest task. – Tells the scheduler to “blacklist” an (executor, task) combination for some amount of time. • spark.task.maxFailures – Set a value that is larger than the maximum number of executors on a node. – Determines the number of failures before the whole application is killed. • spark.speculation – Defaults to false. Did not recommend enabling.
  • 15.
    Long Term: Overhaul TheBlacklist • The Scheduler is critical core code! Bug whack-a-mole. • Driver multi-threaded, asynchronous requests. • Many scenarios considered in Design Doc http://bit.do/bklist – Large Cluster, Large Stages, One Bad Disk (our scenario) – Failing Job, Very Small Cluster – Large Cluster, Very Small Stages – Long Lived Application, Occasional Failed Tasks – Bad Node leads to widespread Shuffle-Fetch Failures – Bad Node, One Executor, Dynamic Allocation – Application programming errors!
  • 16.
    org.apache.spark.scheduler. BlacklistTracker /** * BlacklistTracker isdesigned to track problematic executors and nodes. It supports blacklisting * executors and nodes across an entire application (with a periodic expiry). TaskSetManagers add * additional blacklisting of executors and nodes for individual tasks and stages which works in * concert with the blacklisting here. * * The tracker needs to deal with a variety of workloads, eg.: * * * bad user code -- this may lead to many task failures, but that should not count against * individual executors * * many small stages -- this may prevent a bad executor for having many failures within one * stage, but still many failures over the entire application * * "flaky" executors -- they don't fail every task, but are still faulty enough to merit * blacklisting * * See the design doc on SPARK-8425 for a more in-depth discussion. * * THREADING: As with most helpers of TaskSchedulerImpl, this is not thread-safe. Though it is * called by multiple threads, callers must already have a lock on the TaskSchedulerImpl. The * one exception is [[nodeBlacklist()]], which can be called without holding a lock. */
  • 17.
    Long Term: Scheduler Improvements •New in Spark 2.2 (under development June – December 2016) • spark.blacklist.enabled (Default: false) • spark.blacklist.task.maxTaskAttemptsPerExecutor (Default: 1) • spark.blacklist.task.maxTaskAttemptsPerNode (Default: 2) • spark.blacklist.task.maxFailedTasksPerExecutor (Default: 2) • spark.blacklist.task.maxFailedExecutorsPerNode (Default: 2) – http://spark.apache.org/docs/latest/configuration.html
  • 18.
  • 19.
    Case Study: Shuffle FetchFailures [SPARK-4105]
  • 20.
    SPARK-4105: The Symptom •Non-deterministic FAILED_TO_UNCOMPRESS(5) errors during shuffle read. • Difficult to reproduce. • “Smells” like stream corruption. – Some users saw similar issues with LZF compression. • Not related to spilling. https://0x0fff.com/spark-architecture-shuffle/
  • 21.
    SPARK-4105: Fixed inSpark 2.2 • https://github.com/apache/spark/pull/15923 • “It seems that it's very likely the corruption is introduced by some weird machine/hardware, also the checksum (16 bits) in TCP is not strong enough to identify all the corruption.” • Try to decompressblocks as they come in and check for IOExceptions. • Works for now, maybe we can do better.
  • 22.
  • 23.
    What did welearn? • The Spark Scheduler is responsible for assigning units of work to compute resources. • The scheduler is where the rubber meets the road when it comes to fault tolerance. • There are a few knobs to tweak, but hopefully that is not necessary. • Other things can fail besides the scheduler, too. • Many classical distributed systems problems are still present (even though Spark does a great job of abstracting most of them away).
  • 24.
    Recommendations for Application Developers •Gather and read logs, early and often. – Issues may occur in smaller environments. • Start small: one executor, one host. • Grow slowly. • Use “pen and paper” to determine expectations for job times. • Watch out for stragglers, outliers, and crashes. • Don’t: start critical job on huge cluster and expect perfect performance the first time.
  • 25.