Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Fault Tolerance in Spark:
Lessons Learned
from Production
José Soltren
Cloudera
Who am I?
• Software Engineer at Cloudera focused on
Apache Spark
– …also an Apache Spark contributor
• Previous hardware,...
So… why does Cloudera care about
Fault Tolerance?
• Cloudera supports big customers running big
applications on big hardwa...
Apache Spark
Fault Tolerance Basics
Fault Tolerance Basics
https://0x0fff.com/spark-architecture-shuffle/
Fault Tolerance Basics
• RDDs – Resilient Distributed Datasets: multiple
pieces, multiple copies
• Lineage: not versions, ...
.gs.
s9q)
q)
.-
!

(-
s
'€
->
er+
A
sr
srBlaq
9/.n,?-rt
)-y
rIt
?.
t--
s--
$
(-
f;
O
Lt-.
cf)
+
_g
,e
-l
{so
F
?
ET'L
t<
{...
In
production,
at scale.
Case Study:
Application Outage
[SPARK-8425]
2016-04-22: An Application Outage
• Customer reports an application failure.
– Spark cluster with hundreds of nodes and 8 ...
Disk Failures and Scheduling
• One node has one bad disk.
– …tasks sometimes fail on this node.
– ...tasks consistently fa...
Failure Recap
• There is already support for fault tolerance present –
don’t panic!
• Spark could handle an unreachable no...
Yuck.
:(
Short Term:
Workaround on Spark 1.6 and 2.0
• spark.scheduler.executorTaskBlacklistTime
– Set a value that is much longer ...
Long Term:
Overhaul The Blacklist
• The Scheduler is critical core code! Bug whack-a-mole.
• Driver multi-threaded, asynch...
org.apache.spark.scheduler.
BlacklistTracker
/**
* BlacklistTracker is designed to track problematic executors and nodes. ...
Long Term:
Scheduler Improvements
• New in Spark 2.2 (under development June – December 2016)
• spark.blacklist.enabled (D...
Blacklist
Web UI
Case Study:
Shuffle Fetch Failures
[SPARK-4105]
SPARK-4105: The Symptom
• Non-deterministic
FAILED_TO_UNCOMPRESS(5) errors during
shuffle read.
• Difficult to reproduce.
...
SPARK-4105: Fixed in Spark 2.2
• https://github.com/apache/spark/pull/15923
• “It seems that it's very likely the corrupti...
Closing thoughts.
What did we learn?
• The Spark Scheduler is responsible for assigning units of
work to compute resources.
• The scheduler ...
Recommendations for
Application Developers
• Gather and read logs, early and often.
– Issues may occur in smaller environm...
Thank You.
José Soltren
jose@cloudera.com
Upcoming SlideShare
Loading in …5
×

Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East talk by Jose Soltren

Spark is by its nature very fault tolerant. However, faults, and application failures, can and do happen, in production at scale.
In this talk, we’ll discuss the nuts and bolts of fault tolerance in Spark.
We will begin with a brief overview of the sorts of fault tolerance offered, and lead into a deep dive of the internals of fault tolerance. This will include a discussion of Spark on YARN, scheduling, and resource allocation.
We will then spend some time on a case study and discussing some tools used to find and verify fault tolerance issues. Our case study comes from a customer who experienced an application outage that was root caused to a scheduler bug. We discuss the analysis we did to reach this conclusion and the work that we did to reproduce it locally. We highlight some of the techniques used to simulate faults and find bugs.

At the end, we’ll discuss some future directions for fault tolerance improvements in Spark, such as scheduler and checkpointing changes.

  • Be the first to comment

Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East talk by Jose Soltren

  1. 1. Fault Tolerance in Spark: Lessons Learned from Production José Soltren Cloudera
  2. 2. Who am I? • Software Engineer at Cloudera focused on Apache Spark – …also an Apache Spark contributor • Previous hardware, kernel, and driver hacking experience
  3. 3. So… why does Cloudera care about Fault Tolerance? • Cloudera supports big customers running big applications on big hardware. • How big? – >$1B/yr, – core business logic – 1000+ node clusters. • Outages are really expensive. – …about as expensive as flying a small jet. • Customer’sproblems are our problems.
  4. 4. Apache Spark Fault Tolerance Basics
  5. 5. Fault Tolerance Basics https://0x0fff.com/spark-architecture-shuffle/
  6. 6. Fault Tolerance Basics • RDDs – Resilient Distributed Datasets: multiple pieces, multiple copies • Lineage: not versions, but, a way of re-creating data • HDFS (or HBase or other external store) • Scheduler: Blacklist • Scheduler/Storage: Duplication and Locality
  7. 7. .gs. s9q) q) .- ! (- s '€ -> er+ A sr srBlaq 9/.n,?-rt )-y rIt ?. t-- s-- $ (- f; O Lt-. cf) + _g ,e -l {so F ? ET'L t< {, J c- a ) Na ,€ rntJ c t; ss A -s/ s Nv trD tI .l E3l €it*."1 tr*l 3Fl'.:l" rl tlc..l .J glol--f'l -Yl +l s c-) € tY s /q3 3 0 5.qJ g i ( E(- s t & 3 ..r- 5- -t) t- -+- ,f e ) -+ I .trc.t b E j(- - 3E3-E { ET F € x e II .-{dsA,F<-{ _-e-<*€g_bLF €*J ifr tsJ a- ;lly ?X-) g.R.r{'> ,:rF:(S :sEt =.-JA ") (^ 55 a L .c- , 5 E 1 L a > 3*:*d ,g -Es(f- r€*d s-rt tr f'l -+l-Y f I Il- I-d i- F+- I+( I I { I :ll .-5', ?Pd FtS E$t ee-l ic<3. 2 P C-. t € { t< qf, r rt sf 'A -t =.F c q c-^ -*-:s =f -;
  8. 8. In production, at scale.
  9. 9. Case Study: Application Outage [SPARK-8425]
  10. 10. 2016-04-22: An Application Outage • Customer reports an application failure. – Spark cluster with hundreds of nodes and 8 disks per node. – Application runs on customer’s Spark cluster. – High Availability is critical for this application. • Immediate cause was a disk failure: FileNotFoundException. • “HDFS and YARN responded to the disk failure appropriately but Spark continued to access the failed disk.” – Yikes.
  11. 11. Disk Failures and Scheduling • One node has one bad disk. – …tasks sometimes fail on this node. – ...tasks consistently fail if they hit this disk. • Tasks succeed if they are scheduled on other nodes! • Tasks fail if they are scheduled on the same node. – ...which is likely due to locality preferences. • The scheduler will kill the whole job after some number of failures. • Can’t tell YARN (Mesos?) we have a bad resource.
  12. 12. Failure Recap • There is already support for fault tolerance present – don’t panic! • Spark could handle an unreachable node. • Spark could handle a node with no usable disks. • We hit an edge case. • Failure modes are binary, and not expressive enough.
  13. 13. Yuck. :(
  14. 14. Short Term: Workaround on Spark 1.6 and 2.0 • spark.scheduler.executorTaskBlacklistTime – Set a value that is much longer than the duration of the longest task. – Tells the scheduler to “blacklist” an (executor, task) combination for some amount of time. • spark.task.maxFailures – Set a value that is larger than the maximum number of executors on a node. – Determines the number of failures before the whole application is killed. • spark.speculation – Defaults to false. Did not recommend enabling.
  15. 15. Long Term: Overhaul The Blacklist • The Scheduler is critical core code! Bug whack-a-mole. • Driver multi-threaded, asynchronous requests. • Many scenarios considered in Design Doc http://bit.do/bklist – Large Cluster, Large Stages, One Bad Disk (our scenario) – Failing Job, Very Small Cluster – Large Cluster, Very Small Stages – Long Lived Application, Occasional Failed Tasks – Bad Node leads to widespread Shuffle-Fetch Failures – Bad Node, One Executor, Dynamic Allocation – Application programming errors!
  16. 16. org.apache.spark.scheduler. BlacklistTracker /** * BlacklistTracker is designed to track problematic executors and nodes. It supports blacklisting * executors and nodes across an entire application (with a periodic expiry). TaskSetManagers add * additional blacklisting of executors and nodes for individual tasks and stages which works in * concert with the blacklisting here. * * The tracker needs to deal with a variety of workloads, eg.: * * * bad user code -- this may lead to many task failures, but that should not count against * individual executors * * many small stages -- this may prevent a bad executor for having many failures within one * stage, but still many failures over the entire application * * "flaky" executors -- they don't fail every task, but are still faulty enough to merit * blacklisting * * See the design doc on SPARK-8425 for a more in-depth discussion. * * THREADING: As with most helpers of TaskSchedulerImpl, this is not thread-safe. Though it is * called by multiple threads, callers must already have a lock on the TaskSchedulerImpl. The * one exception is [[nodeBlacklist()]], which can be called without holding a lock. */
  17. 17. Long Term: Scheduler Improvements • New in Spark 2.2 (under development June – December 2016) • spark.blacklist.enabled (Default: false) • spark.blacklist.task.maxTaskAttemptsPerExecutor (Default: 1) • spark.blacklist.task.maxTaskAttemptsPerNode (Default: 2) • spark.blacklist.task.maxFailedTasksPerExecutor (Default: 2) • spark.blacklist.task.maxFailedExecutorsPerNode (Default: 2) – http://spark.apache.org/docs/latest/configuration.html
  18. 18. Blacklist Web UI
  19. 19. Case Study: Shuffle Fetch Failures [SPARK-4105]
  20. 20. SPARK-4105: The Symptom • Non-deterministic FAILED_TO_UNCOMPRESS(5) errors during shuffle read. • Difficult to reproduce. • “Smells” like stream corruption. – Some users saw similar issues with LZF compression. • Not related to spilling. https://0x0fff.com/spark-architecture-shuffle/
  21. 21. SPARK-4105: Fixed in Spark 2.2 • https://github.com/apache/spark/pull/15923 • “It seems that it's very likely the corruption is introduced by some weird machine/hardware, also the checksum (16 bits) in TCP is not strong enough to identify all the corruption.” • Try to decompressblocks as they come in and check for IOExceptions. • Works for now, maybe we can do better.
  22. 22. Closing thoughts.
  23. 23. What did we learn? • The Spark Scheduler is responsible for assigning units of work to compute resources. • The scheduler is where the rubber meets the road when it comes to fault tolerance. • There are a few knobs to tweak, but hopefully that is not necessary. • Other things can fail besides the scheduler, too. • Many classical distributed systems problems are still present (even though Spark does a great job of abstracting most of them away).
  24. 24. Recommendations for Application Developers • Gather and read logs, early and often. – Issues may occur in smaller environments. • Start small: one executor, one host. • Grow slowly. • Use “pen and paper” to determine expectations for job times. • Watch out for stragglers, outliers, and crashes. • Don’t: start critical job on huge cluster and expect perfect performance the first time.
  25. 25. Thank You. José Soltren jose@cloudera.com

    Be the first to comment

    Login to see the comments

  • MichaelLi100

    Mar. 10, 2017
  • ssuser1f62f4

    Jan. 23, 2019

Spark is by its nature very fault tolerant. However, faults, and application failures, can and do happen, in production at scale. In this talk, we’ll discuss the nuts and bolts of fault tolerance in Spark. We will begin with a brief overview of the sorts of fault tolerance offered, and lead into a deep dive of the internals of fault tolerance. This will include a discussion of Spark on YARN, scheduling, and resource allocation. We will then spend some time on a case study and discussing some tools used to find and verify fault tolerance issues. Our case study comes from a customer who experienced an application outage that was root caused to a scheduler bug. We discuss the analysis we did to reach this conclusion and the work that we did to reproduce it locally. We highlight some of the techniques used to simulate faults and find bugs. At the end, we’ll discuss some future directions for fault tolerance improvements in Spark, such as scheduler and checkpointing changes.

Views

Total views

2,384

On Slideshare

0

From embeds

0

Number of embeds

1

Actions

Downloads

94

Shares

0

Comments

0

Likes

2

×