SlideShare a Scribd company logo
1 of 49
Failing Gracefully 
Aaron Davidson 
07/01/2014
What does “failure” mean for Spark? 
• Spark is a cluster-compute framework 
targeted at analytics workloads 
• Supported failure modes: 
– Transient errors (e.g., network, HDFS outage) 
– Worker machine failures 
• Unsupported failure modes: 
– Systemic exceptions (e.g., bad code, OOMs) 
– Driver machine failure
What makes a recovery model good? 
• A good recovery model should: 
– Be simple 
– Consistently make progress towards 
completion 
– Always be in use (“fail constantly”)
Outline of this talk 
• Spark architecture overview 
• Common failures 
• Special considerations for fault tolerance
Example program 
Goal: Find number of names per “first character” 
sc.textFile(“hdfs:/names”) 
.map(name => (name.charAt(0), 1)) 
.reduceByKey(_ + _) 
.collect() 
うえださいとうえしん
Example program 
Goal: Find number of names per “first character” 
sc.textFile(“hdfs:/names”) 
.map(name => (name.charAt(0), 1)) 
.reduceByKey(_ + _) 
.collect() 
うえださいとうえしん 
(う, 1) (さ, 1) (う, 1)
Example program 
Goal: Find number of names per “first character” 
sc.textFile(“hdfs:/names”) 
.map(name => (name.charAt(0), 1)) 
.reduceByKey(_ + _) 
.collect() 
うえださいとうえしん 
(う, 1) (さ, 1) (う, 1) 
(う, 2) (さ, 1)
Example program 
Goal: Find number of names per “first character” 
sc.textFile(“hdfs:/names”) 
.map(name => (name.charAt(0), 1)) 
.reduceByKey(_ + _) 
.collect() 
うえださいとうえしん 
(う, 1) (さ, 1) (う, 1) 
(う, 2) (さ, 1) 
res0 = [(う,2), (さ,1)]
Spark Execution Model 
1. Create DAG of RDDs to represent 
computation 
2. Create logical execution plan for DAG 
3. Schedule and execute individual tasks
Step 1: Create RDDs 
sc.textFile(“hdfs:/names”) 
map(name => (name.charAt(0), 1)) 
reduceByKey(_ + _) 
collect()
Step 1: Create RDDs 
HadoopRDD 
map() 
reduceByKey() 
collect()
Step 2: Create execution plan 
• Pipeline as much as possible 
• Split into “stages” based on need to 
reorganize data 
Stage 1 HadoopRDD 
map() 
reduceByKey() 
collect() 
うえださいとうえしん 
(う, 1) (さ, 1) (う, 1) 
(う, 2) 
(さ, 1) 
res0 = [(う,2), (さ,1)] 
Stage 2
Step 3: Schedule tasks 
• Split each stage into tasks 
• A task is data + computation 
• Execute all tasks within a stage before 
moving on
Step 3: Schedule tasks 
Computation Data 
hdfs:/names/0.gz 
hdfs:/names/1.gz 
hdfs:/names/2.gz 
Task 1 
hdfs:/names/3.gz 
… 
Stage 1 
HadoopRDD 
map() 
Task 0 
hdfs:/names/0.gz 
HadoopRDD 
map() 
Task 1 
hdfs:/names/1.gz 
HadoopRDD 
map()
Step 3: Schedule tasks 
/names/0.gz 
/names/3.gz 
/names/0.gz 
HadoopRDD 
map() Time 
HDFS 
HDFS 
/names/1.gz 
/names/0.gz 
HDFS 
/names/2.gz 
/names/3.gz
Step 3: Schedule tasks 
HDFS 
/names/0.gz 
/names/3.gz 
HDFS 
/names/1.gz 
/names/0.gz 
HDFS 
/names/2.gz 
/names/3.gz 
/names/0.gz 
HadoopRDD 
map() Time
Step 3: Schedule tasks 
HDFS 
/names/0.gz 
/names/3.gz 
HDFS 
/names/1.gz 
/names/0.gz 
HDFS 
/names/2.gz 
/names/3.gz 
/names/0.gz 
HadoopRDD 
map() 
Time
Step 3: Schedule tasks 
HDFS 
/names/0.gz 
/names/3.gz 
HDFS 
/names/1.gz 
/names/0.gz 
HDFS 
/names/2.gz 
/names/3.gz 
/names/0.gz 
HadoopRDD 
map() 
/names/1.gz 
HadoopRDD 
map() Time
Step 3: Schedule tasks 
HDFS 
/names/0.gz 
/names/3.gz 
HDFS 
/names/1.gz 
/names/0.gz 
HDFS 
/names/2.gz 
/names/3.gz 
/names/0.gz 
HadoopRDD 
map() 
/names/1.gz 
HadoopRDD 
map() 
Time
Step 3: Schedule tasks 
HDFS 
/names/0.gz 
/names/3.gz 
HDFS 
/names/1.gz 
/names/0.gz 
HDFS 
/names/2.gz 
/names/3.gz 
/names/0.gz 
HadoopRDD 
map() 
/names/2.gz 
HadoopRDD 
map() 
/names/1.gz 
HadoopRDD 
map() 
Time
Step 3: Schedule tasks 
HDFS 
/names/0.gz 
/names/3.gz 
HDFS 
/names/1.gz 
/names/0.gz 
HDFS 
/names/2.gz 
/names/3.gz 
/names/0.gz 
HadoopRDD 
map() 
/names/1.gz 
HadoopRDD 
map() 
/names/2.gz 
HadoopRDD 
map() 
Time
Step 3: Schedule tasks 
HDFS 
/names/0.gz 
/names/3.gz 
HDFS 
/names/1.gz 
/names/0.gz 
HDFS 
/names/2.gz 
/names/3.gz 
/names/0.gz 
HadoopRDD 
map() 
/names/1.gz 
HadoopRDD 
map() 
/names/2.gz 
HadoopRDD 
map() 
Time 
/names/3.gz 
HadoopRDD 
map()
Step 3: Schedule tasks 
HDFS 
/names/0.gz 
/names/3.gz 
HDFS 
/names/1.gz 
/names/0.gz 
HDFS 
/names/2.gz 
/names/3.gz 
/names/0.gz 
HadoopRDD 
map() 
/names/1.gz 
HadoopRDD 
map() 
/names/2.gz 
HadoopRDD 
map() 
Time 
/names/3.gz 
HadoopRDD 
map()
Step 3: Schedule tasks 
HDFS 
/names/0.gz 
/names/3.gz 
HDFS 
/names/1.gz 
/names/0.gz 
HDFS 
/names/2.gz 
/names/3.gz 
/names/0.gz 
HadoopRDD 
map() 
/names/1.gz 
HadoopRDD 
map() 
/names/2.gz 
HadoopRDD 
map() 
Time 
/names/3.gz 
HadoopRDD 
map()
The Shuffle 
Stage 1 
Stage 2 
HadoopRDD 
map() 
reduceByKey() 
collect()
The Shuffle 
• Redistributes data among partitions 
• Hash keys into buckets 
Stage 1 
Stage 2 
• On reduce side, build hash 
map within each partition 
Reduce 0: 
{ 
う=> 137, 
さ=> 86, 
… 
} 
Reduce 1: 
{ 
な=> 144, 
る=> 12, 
… 
} 
…
The Shuffle 
Disk 
Stage 1 
Stage 2 
• Pull-based, not push-based 
• Write intermediate files to disk
Step 3: Schedule tasks 
HDFS 
/names/0.gz 
/names/3.gz 
HDFS 
/names/1.gz 
/names/0.gz 
HDFS 
/names/2.gz 
/names/3.gz 
Time 
/names/0.gz 
HadoopRDD 
map() 
/names/1.gz 
HadoopRDD 
map() 
/names/2.gz 
HadoopRDD 
map() 
/names/3.gz 
HadoopRDD 
map() 
Reduce 0 
reduceByKey 
collect 
Reduce 1 
reduceByKey 
collect 
Reduce 2 
reduceByKey 
collect 
Reduce 3 
reduceByKey 
collect
When things go wrong 
• Task failure 
• Task taking a long time 
• Executor failure
Task Failure 
• Task fails with exception  retry it 
• RDDs are immutable and “stateless”, so 
rerunning should have same effect 
– Special logic required for tasks which write 
data out (atomic rename) 
– Statelessness not enforced by programming 
model 
sc.parallelize(0 until 100).map { x => 
val myVal = sys.prop(“foo”, 0) + x 
sys.prop(“foo”) = myVal 
myVal 
}
Task Failure 
HDFS 
/names/0.gz 
/names/3.gz 
HDFS 
/names/1.gz 
/names/0.gz 
HDFS 
/names/2.gz 
/names/3.gz 
/names/0.gz 
HadoopRDD 
map() 
/names/1.gz 
HadoopRDD 
map() 
/names/2.gz 
HadoopRDD 
map() 
Time 
/names/3.gz 
HadoopRDD 
map() 
/names/2.gz 
HadoopRDD 
map()
Speculative Execeution 
• Try to predict slow or failing tasks, restart 
task on a different machine in parallel 
• Also assumes immutability and 
statelessness 
• Enable with “spark.speculation=true”
Speculative Execution 
HDFS 
/names/0.gz 
/names/3.gz 
HDFS 
/names/1.gz 
/names/0.gz 
HDFS 
/names/2.gz 
/names/3.gz 
/names/0.gz 
HadoopRDD 
map() 
/names/1.gz 
HadoopRDD 
map() 
/names/2.gz 
HadoopRDD 
map() 
Time 
/names/3.gz 
HadoopRDD 
map()
Speculative Execution 
HDFS 
/names/0.gz 
/names/3.gz 
HDFS 
/names/1.gz 
/names/0.gz 
HDFS 
/names/2.gz 
/names/3.gz 
/names/0.gz 
HadoopRDD 
map() 
/names/1.gz 
HadoopRDD 
map() 
/names/2.gz 
HadoopRDD 
map() 
Time 
/names/3.gz 
HadoopRDD 
map()
Speculative Execution 
HDFS 
/names/0.gz 
/names/3.gz 
HDFS 
/names/1.gz 
/names/0.gz 
HDFS 
/names/2.gz 
/names/3.gz 
/names/0.gz 
HadoopRDD 
map() 
/names/1.gz 
HadoopRDD 
map() 
/names/2.gz 
HadoopRDD 
map() 
Time 
/names/3.gz 
HadoopRDD 
map()
Speculative Execution 
HDFS 
/names/0.gz 
/names/3.gz 
HDFS 
/names/1.gz 
/names/0.gz 
HDFS 
/names/2.gz 
/names/3.gz 
/names/0.gz 
HadoopRDD 
map() 
/names/1.gz 
HadoopRDD 
map() 
/names/2.gz 
HadoopRDD 
map() 
Time 
/names/3.gz 
HadoopRDD 
map() 
/names/3.gz 
HadoopRDD 
map()
Speculative Execution 
HDFS 
/names/0.gz 
/names/3.gz 
HDFS 
/names/1.gz 
/names/0.gz 
HDFS 
/names/2.gz 
/names/3.gz 
/names/0.gz 
HadoopRDD 
map() 
/names/1.gz 
HadoopRDD 
map() 
/names/2.gz 
HadoopRDD 
map() 
Time 
/names/3.gz 
HadoopRDD 
map() 
/names/3.gz 
HadoopRDD 
map()
Executor Failure 
• Examine tasks run on that executor: 
– If task from final stage, we’ve already 
collected its results – don’t rerun 
– If task from intermediate stage, must rerun. 
• May require executing “finished” stage
Step 3: Schedule tasks 
HDFS 
/names/0.gz 
/names/3.gz 
HDFS 
/names/1.gz 
/names/0.gz 
HDFS 
/names/2.gz 
/names/3.gz 
Time 
/names/0.gz 
HadoopRDD 
map() 
/names/1.gz 
HadoopRDD 
map() 
/names/2.gz 
HadoopRDD 
map() 
/names/3.gz 
HadoopRDD 
map() 
Reduce 0 
reduceByKey 
collect 
Reduce 1 
reduceByKey 
collect 
Reduce 2 
reduceByKey 
collect 
Reduce 3 
reduceByKey 
collect
Step 3: Schedule tasks 
HDFS 
/names/0.gz 
/names/3.gz 
HDFS 
/names/1.gz 
/names/0.gz 
HDFS 
/names/2.gz 
/names/3.gz 
Time 
/names/0.gz 
HadoopRDD 
map() 
/names/1.gz 
HadoopRDD 
map() 
/names/2.gz 
HadoopRDD 
map() 
/names/3.gz 
HadoopRDD 
map() 
Reduce 0 
reduceByKey 
collect 
Reduce 1 
reduceByKey 
collect 
Reduce 2 
reduceByKey 
collect 
Reduce 3 
reduceByKey 
collect
Step 3: Schedule tasks 
/names/0.gz 
HadoopRDD 
map() 
HDFS 
/names/1.gz 
/names/0.gz 
HDFS 
/names/2.gz 
/names/3.gz 
Time 
/names/1.gz 
HadoopRDD 
map() 
/names/2.gz 
HadoopRDD 
map() 
Reduce 0 
reduceByKey 
collect 
Reduce 1 
reduceByKey 
collect 
Reduce 2 
reduceByKey 
collect 
Reduce 3 
reduceByKey 
collect 
Tasks to rerun: 
/names/3.gz 
HadoopRDD 
map() 
Completed tasks: 
/names/0.gz 
HadoopRDD 
map() 
/names/3.gz 
HadoopRDD 
map() 
Reduce 3 
reduceByKey 
collect
Other Failure Scenarios 
What happens when: 
1. We have a large number of stages? 
2. Our input data is not immutable (e.g. 
streaming)? 
3. Executors had cached data?
1. Dealing with many stages 
Problem: 
Executor loss causes recomputation of all non-final 
stages 
Solution: 
Checkpoint whole RDD to HDFS periodically 
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7
1. Dealing with many stages 
Problem: 
Executor loss causes recomputation of all non-final 
stages 
Solution: 
Checkpoint whole RDD to HDFS periodically 
HDFS 
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7
2. Dealing with lost input data 
Problem: 
Input data is consumed when read (e.g., 
streaming), and re-execution is not possible. 
Solution: 
No general solution today – either use an HDFS 
source or implement it yourself. 
Spark 1.2 roadmap includes a general solution 
which may trade throughput for safety.
3. Loss of cached data 
Problem: 
Executor loss causes cache to become 
incomplete. 
Solution: 
Do nothing – a task caches data locally while it 
runs, causing the cache to stabilize.
3. Loss of cached data 
val file = sc.textFile(“s3n://”).cache() // 8 blocks 
for (i <- 0 until 10) { 
file.count() 
} 
Cache Block 0 
Block 2 
Block 4 
Block 6 
Block 0 
Block 2 
Block 4 
Block 6 
Cache Block 1 
Block 3 
Block 5 
Block 7 
Block 1 
Block 3 
Block 5 
Block 7 
i = 0 
i = 1 
Block 0 
Block 2 
Block 4 
Block 6 
Block 1 
Block 3 
Block 5 
Block 7
3. Loss of cached data 
val file = sc.textFile(“s3n://”).cache() 
for (i <- 0 until 10) { 
file.count() 
} 
Cache 
i = 0 i = 1 
Block 2 
Block 4 
Block 6 
Block 7 
Cache Block 1 
Block 3 
Block 5 
Block 7 
Block 1 
Block 3 
Block 5 
Block 7 
Block 1 
Block 3 
Block 5 
Block 7 
Block 2 
Block 4 
Block 6 
Block 7 
Block 0 
Block 1 
Block 3 
Block 5 
Block 0 
i = 2 
i = 3 
Block 2 
Block 4 
Block 6 
Block 7 
Block 0 
Block 1 
Block 3 
Block 5
Conclusions 
• Spark comes equipped to handle the most 
common forms of failure 
• Special care must be taken in certain 
cases: 
– Highly iterative use-cases (checkpointing) 
– Streaming (atomic data consumption) 
– Violating Spark’s core immutability and 
statelessness assumptions

More Related Content

What's hot

Hadoop Installation presentation
Hadoop Installation presentationHadoop Installation presentation
Hadoop Installation presentationpuneet yadav
 
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Titus Damaiyanti
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...CloudxLab
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsCheng Min Chi
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersRahul Jain
 
Ecossistema Hadoop no Magazine Luiza
Ecossistema Hadoop no Magazine LuizaEcossistema Hadoop no Magazine Luiza
Ecossistema Hadoop no Magazine LuizaNelson Forte
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceDr Ganesh Iyer
 
Hadoop 20111117
Hadoop 20111117Hadoop 20111117
Hadoop 20111117exsuns
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopMohamed Elsaka
 
Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...
Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...
Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...CloudxLab
 
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010Thejas Nair
 

What's hot (20)

Hadoop Installation presentation
Hadoop Installation presentationHadoop Installation presentation
Hadoop Installation presentation
 
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
 
Scala+data
Scala+dataScala+data
Scala+data
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internals
 
Hadoop architecture by ajay
Hadoop architecture by ajayHadoop architecture by ajay
Hadoop architecture by ajay
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
Ecossistema Hadoop no Magazine Luiza
Ecossistema Hadoop no Magazine LuizaEcossistema Hadoop no Magazine Luiza
Ecossistema Hadoop no Magazine Luiza
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Hadoop 20111117
Hadoop 20111117Hadoop 20111117
Hadoop 20111117
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
 
Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...
Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...
Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...
 
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010
 
Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
 
Hadoop - Introduction to mapreduce
Hadoop -  Introduction to mapreduceHadoop -  Introduction to mapreduce
Hadoop - Introduction to mapreduce
 
Using R with Hadoop
Using R with HadoopUsing R with Hadoop
Using R with Hadoop
 

Viewers also liked

GraphXはScalaエンジニアにとってのブルーオーシャン @ Scala Matsuri 2014
GraphXはScalaエンジニアにとってのブルーオーシャン @ Scala Matsuri 2014GraphXはScalaエンジニアにとってのブルーオーシャン @ Scala Matsuri 2014
GraphXはScalaエンジニアにとってのブルーオーシャン @ Scala Matsuri 2014鉄平 土佐
 
20140908 spark sql & catalyst
20140908 spark sql & catalyst20140908 spark sql & catalyst
20140908 spark sql & catalystTakuya UESHIN
 
Hadoop Source Code Reading #17
Hadoop Source Code Reading #17Hadoop Source Code Reading #17
Hadoop Source Code Reading #17Shingo Furuyama
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And HdfsCloudera, Inc.
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache HiveTapan Avasthi
 
分散処理基盤Apache Hadoop入門とHadoopエコシステムの最新技術動向 (オープンソースカンファレンス 2015 Tokyo/Spring 講...
分散処理基盤Apache Hadoop入門とHadoopエコシステムの最新技術動向 (オープンソースカンファレンス 2015 Tokyo/Spring 講...分散処理基盤Apache Hadoop入門とHadoopエコシステムの最新技術動向 (オープンソースカンファレンス 2015 Tokyo/Spring 講...
分散処理基盤Apache Hadoop入門とHadoopエコシステムの最新技術動向 (オープンソースカンファレンス 2015 Tokyo/Spring 講...NTT DATA OSS Professional Services
 
MapReduce/Spark/Tezのフェアな性能比較に向けて (Cloudera World Tokyo 2014 LT講演)
MapReduce/Spark/Tezのフェアな性能比較に向けて (Cloudera World Tokyo 2014 LT講演)MapReduce/Spark/Tezのフェアな性能比較に向けて (Cloudera World Tokyo 2014 LT講演)
MapReduce/Spark/Tezのフェアな性能比較に向けて (Cloudera World Tokyo 2014 LT講演)Hadoop / Spark Conference Japan
 
Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016
Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016
Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016Nagato Kasaki
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache HiveAvkash Chauhan
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera, Inc.
 

Viewers also liked (11)

GraphXはScalaエンジニアにとってのブルーオーシャン @ Scala Matsuri 2014
GraphXはScalaエンジニアにとってのブルーオーシャン @ Scala Matsuri 2014GraphXはScalaエンジニアにとってのブルーオーシャン @ Scala Matsuri 2014
GraphXはScalaエンジニアにとってのブルーオーシャン @ Scala Matsuri 2014
 
20140908 spark sql & catalyst
20140908 spark sql & catalyst20140908 spark sql & catalyst
20140908 spark sql & catalyst
 
Hadoop Source Code Reading #17
Hadoop Source Code Reading #17Hadoop Source Code Reading #17
Hadoop Source Code Reading #17
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
 
分散処理基盤Apache Hadoop入門とHadoopエコシステムの最新技術動向 (オープンソースカンファレンス 2015 Tokyo/Spring 講...
分散処理基盤Apache Hadoop入門とHadoopエコシステムの最新技術動向 (オープンソースカンファレンス 2015 Tokyo/Spring 講...分散処理基盤Apache Hadoop入門とHadoopエコシステムの最新技術動向 (オープンソースカンファレンス 2015 Tokyo/Spring 講...
分散処理基盤Apache Hadoop入門とHadoopエコシステムの最新技術動向 (オープンソースカンファレンス 2015 Tokyo/Spring 講...
 
MapReduce/Spark/Tezのフェアな性能比較に向けて (Cloudera World Tokyo 2014 LT講演)
MapReduce/Spark/Tezのフェアな性能比較に向けて (Cloudera World Tokyo 2014 LT講演)MapReduce/Spark/Tezのフェアな性能比較に向けて (Cloudera World Tokyo 2014 LT講演)
MapReduce/Spark/Tezのフェアな性能比較に向けて (Cloudera World Tokyo 2014 LT講演)
 
Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016
Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016
Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
 
Cloudera impala
Cloudera impalaCloudera impala
Cloudera impala
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for Hadoop
 

Similar to Failing gracefully

Whirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIveWhirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIveEdward Capriolo
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceUday Vakalapudi
 
A deeper-understanding-of-spark-internals-aaron-davidson
A deeper-understanding-of-spark-internals-aaron-davidsonA deeper-understanding-of-spark-internals-aaron-davidson
A deeper-understanding-of-spark-internals-aaron-davidsonCheng Min Chi
 
データ解析技術入門(Hadoop編)
データ解析技術入門(Hadoop編)データ解析技術入門(Hadoop編)
データ解析技術入門(Hadoop編)Takumi Asai
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-servicesSreenu Musham
 
Data analysis on hadoop
Data analysis on hadoopData analysis on hadoop
Data analysis on hadoopFrank Y
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryIJRESJOURNAL
 
Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)Ferran Galí Reniu
 
Assignment 1 MapReduce With Hadoop
Assignment 1  MapReduce With HadoopAssignment 1  MapReduce With Hadoop
Assignment 1 MapReduce With HadoopAllison Thompson
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part IIArjen de Vries
 

Similar to Failing gracefully (20)

Bd class 2 complete
Bd class 2 completeBd class 2 complete
Bd class 2 complete
 
Whirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIveWhirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIve
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Overview of Spark for HPC
Overview of Spark for HPCOverview of Spark for HPC
Overview of Spark for HPC
 
A deeper-understanding-of-spark-internals-aaron-davidson
A deeper-understanding-of-spark-internals-aaron-davidsonA deeper-understanding-of-spark-internals-aaron-davidson
A deeper-understanding-of-spark-internals-aaron-davidson
 
データ解析技術入門(Hadoop編)
データ解析技術入門(Hadoop編)データ解析技術入門(Hadoop編)
データ解析技術入門(Hadoop編)
 
Hadoop
HadoopHadoop
Hadoop
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Data analysis on hadoop
Data analysis on hadoopData analysis on hadoop
Data analysis on hadoop
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on Raspberry
 
Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
SparkNotes
SparkNotesSparkNotes
SparkNotes
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Assignment 1 MapReduce With Hadoop
Assignment 1  MapReduce With HadoopAssignment 1  MapReduce With Hadoop
Assignment 1 MapReduce With Hadoop
 
spark-tutorial.pptx
spark-tutorial.pptxspark-tutorial.pptx
spark-tutorial.pptx
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
 

More from Takuya UESHIN

Introducing Koalas 1.0 (and 1.1)
Introducing Koalas 1.0 (and 1.1)Introducing Koalas 1.0 (and 1.1)
Introducing Koalas 1.0 (and 1.1)Takuya UESHIN
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsTakuya UESHIN
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsTakuya UESHIN
 
2019.03.19 Deep Dive into Spark SQL with Advanced Performance Tuning
2019.03.19 Deep Dive into Spark SQL with Advanced Performance Tuning2019.03.19 Deep Dive into Spark SQL with Advanced Performance Tuning
2019.03.19 Deep Dive into Spark SQL with Advanced Performance TuningTakuya UESHIN
 
An Insider’s Guide to Maximizing Spark SQL Performance
 An Insider’s Guide to Maximizing Spark SQL Performance An Insider’s Guide to Maximizing Spark SQL Performance
An Insider’s Guide to Maximizing Spark SQL PerformanceTakuya UESHIN
 
Deep Dive into Spark SQL with Advanced Performance Tuning
Deep Dive into Spark SQL with Advanced Performance TuningDeep Dive into Spark SQL with Advanced Performance Tuning
Deep Dive into Spark SQL with Advanced Performance TuningTakuya UESHIN
 
Apache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache SparkApache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache SparkTakuya UESHIN
 
Introduction to Spark SQL & Catalyst
Introduction to Spark SQL & CatalystIntroduction to Spark SQL & Catalyst
Introduction to Spark SQL & CatalystTakuya UESHIN
 
20110616 HBase勉強会(第二回)
20110616 HBase勉強会(第二回)20110616 HBase勉強会(第二回)
20110616 HBase勉強会(第二回)Takuya UESHIN
 
20100724 HBaseプログラミング
20100724 HBaseプログラミング20100724 HBaseプログラミング
20100724 HBaseプログラミングTakuya UESHIN
 

More from Takuya UESHIN (10)

Introducing Koalas 1.0 (and 1.1)
Introducing Koalas 1.0 (and 1.1)Introducing Koalas 1.0 (and 1.1)
Introducing Koalas 1.0 (and 1.1)
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
2019.03.19 Deep Dive into Spark SQL with Advanced Performance Tuning
2019.03.19 Deep Dive into Spark SQL with Advanced Performance Tuning2019.03.19 Deep Dive into Spark SQL with Advanced Performance Tuning
2019.03.19 Deep Dive into Spark SQL with Advanced Performance Tuning
 
An Insider’s Guide to Maximizing Spark SQL Performance
 An Insider’s Guide to Maximizing Spark SQL Performance An Insider’s Guide to Maximizing Spark SQL Performance
An Insider’s Guide to Maximizing Spark SQL Performance
 
Deep Dive into Spark SQL with Advanced Performance Tuning
Deep Dive into Spark SQL with Advanced Performance TuningDeep Dive into Spark SQL with Advanced Performance Tuning
Deep Dive into Spark SQL with Advanced Performance Tuning
 
Apache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache SparkApache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache Spark
 
Introduction to Spark SQL & Catalyst
Introduction to Spark SQL & CatalystIntroduction to Spark SQL & Catalyst
Introduction to Spark SQL & Catalyst
 
20110616 HBase勉強会(第二回)
20110616 HBase勉強会(第二回)20110616 HBase勉強会(第二回)
20110616 HBase勉強会(第二回)
 
20100724 HBaseプログラミング
20100724 HBaseプログラミング20100724 HBaseプログラミング
20100724 HBaseプログラミング
 

Recently uploaded

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Recently uploaded (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

Failing gracefully

  • 1. Failing Gracefully Aaron Davidson 07/01/2014
  • 2. What does “failure” mean for Spark? • Spark is a cluster-compute framework targeted at analytics workloads • Supported failure modes: – Transient errors (e.g., network, HDFS outage) – Worker machine failures • Unsupported failure modes: – Systemic exceptions (e.g., bad code, OOMs) – Driver machine failure
  • 3. What makes a recovery model good? • A good recovery model should: – Be simple – Consistently make progress towards completion – Always be in use (“fail constantly”)
  • 4. Outline of this talk • Spark architecture overview • Common failures • Special considerations for fault tolerance
  • 5. Example program Goal: Find number of names per “first character” sc.textFile(“hdfs:/names”) .map(name => (name.charAt(0), 1)) .reduceByKey(_ + _) .collect() うえださいとうえしん
  • 6. Example program Goal: Find number of names per “first character” sc.textFile(“hdfs:/names”) .map(name => (name.charAt(0), 1)) .reduceByKey(_ + _) .collect() うえださいとうえしん (う, 1) (さ, 1) (う, 1)
  • 7. Example program Goal: Find number of names per “first character” sc.textFile(“hdfs:/names”) .map(name => (name.charAt(0), 1)) .reduceByKey(_ + _) .collect() うえださいとうえしん (う, 1) (さ, 1) (う, 1) (う, 2) (さ, 1)
  • 8. Example program Goal: Find number of names per “first character” sc.textFile(“hdfs:/names”) .map(name => (name.charAt(0), 1)) .reduceByKey(_ + _) .collect() うえださいとうえしん (う, 1) (さ, 1) (う, 1) (う, 2) (さ, 1) res0 = [(う,2), (さ,1)]
  • 9. Spark Execution Model 1. Create DAG of RDDs to represent computation 2. Create logical execution plan for DAG 3. Schedule and execute individual tasks
  • 10. Step 1: Create RDDs sc.textFile(“hdfs:/names”) map(name => (name.charAt(0), 1)) reduceByKey(_ + _) collect()
  • 11. Step 1: Create RDDs HadoopRDD map() reduceByKey() collect()
  • 12. Step 2: Create execution plan • Pipeline as much as possible • Split into “stages” based on need to reorganize data Stage 1 HadoopRDD map() reduceByKey() collect() うえださいとうえしん (う, 1) (さ, 1) (う, 1) (う, 2) (さ, 1) res0 = [(う,2), (さ,1)] Stage 2
  • 13. Step 3: Schedule tasks • Split each stage into tasks • A task is data + computation • Execute all tasks within a stage before moving on
  • 14. Step 3: Schedule tasks Computation Data hdfs:/names/0.gz hdfs:/names/1.gz hdfs:/names/2.gz Task 1 hdfs:/names/3.gz … Stage 1 HadoopRDD map() Task 0 hdfs:/names/0.gz HadoopRDD map() Task 1 hdfs:/names/1.gz HadoopRDD map()
  • 15. Step 3: Schedule tasks /names/0.gz /names/3.gz /names/0.gz HadoopRDD map() Time HDFS HDFS /names/1.gz /names/0.gz HDFS /names/2.gz /names/3.gz
  • 16. Step 3: Schedule tasks HDFS /names/0.gz /names/3.gz HDFS /names/1.gz /names/0.gz HDFS /names/2.gz /names/3.gz /names/0.gz HadoopRDD map() Time
  • 17. Step 3: Schedule tasks HDFS /names/0.gz /names/3.gz HDFS /names/1.gz /names/0.gz HDFS /names/2.gz /names/3.gz /names/0.gz HadoopRDD map() Time
  • 18. Step 3: Schedule tasks HDFS /names/0.gz /names/3.gz HDFS /names/1.gz /names/0.gz HDFS /names/2.gz /names/3.gz /names/0.gz HadoopRDD map() /names/1.gz HadoopRDD map() Time
  • 19. Step 3: Schedule tasks HDFS /names/0.gz /names/3.gz HDFS /names/1.gz /names/0.gz HDFS /names/2.gz /names/3.gz /names/0.gz HadoopRDD map() /names/1.gz HadoopRDD map() Time
  • 20. Step 3: Schedule tasks HDFS /names/0.gz /names/3.gz HDFS /names/1.gz /names/0.gz HDFS /names/2.gz /names/3.gz /names/0.gz HadoopRDD map() /names/2.gz HadoopRDD map() /names/1.gz HadoopRDD map() Time
  • 21. Step 3: Schedule tasks HDFS /names/0.gz /names/3.gz HDFS /names/1.gz /names/0.gz HDFS /names/2.gz /names/3.gz /names/0.gz HadoopRDD map() /names/1.gz HadoopRDD map() /names/2.gz HadoopRDD map() Time
  • 22. Step 3: Schedule tasks HDFS /names/0.gz /names/3.gz HDFS /names/1.gz /names/0.gz HDFS /names/2.gz /names/3.gz /names/0.gz HadoopRDD map() /names/1.gz HadoopRDD map() /names/2.gz HadoopRDD map() Time /names/3.gz HadoopRDD map()
  • 23. Step 3: Schedule tasks HDFS /names/0.gz /names/3.gz HDFS /names/1.gz /names/0.gz HDFS /names/2.gz /names/3.gz /names/0.gz HadoopRDD map() /names/1.gz HadoopRDD map() /names/2.gz HadoopRDD map() Time /names/3.gz HadoopRDD map()
  • 24. Step 3: Schedule tasks HDFS /names/0.gz /names/3.gz HDFS /names/1.gz /names/0.gz HDFS /names/2.gz /names/3.gz /names/0.gz HadoopRDD map() /names/1.gz HadoopRDD map() /names/2.gz HadoopRDD map() Time /names/3.gz HadoopRDD map()
  • 25. The Shuffle Stage 1 Stage 2 HadoopRDD map() reduceByKey() collect()
  • 26. The Shuffle • Redistributes data among partitions • Hash keys into buckets Stage 1 Stage 2 • On reduce side, build hash map within each partition Reduce 0: { う=> 137, さ=> 86, … } Reduce 1: { な=> 144, る=> 12, … } …
  • 27. The Shuffle Disk Stage 1 Stage 2 • Pull-based, not push-based • Write intermediate files to disk
  • 28. Step 3: Schedule tasks HDFS /names/0.gz /names/3.gz HDFS /names/1.gz /names/0.gz HDFS /names/2.gz /names/3.gz Time /names/0.gz HadoopRDD map() /names/1.gz HadoopRDD map() /names/2.gz HadoopRDD map() /names/3.gz HadoopRDD map() Reduce 0 reduceByKey collect Reduce 1 reduceByKey collect Reduce 2 reduceByKey collect Reduce 3 reduceByKey collect
  • 29. When things go wrong • Task failure • Task taking a long time • Executor failure
  • 30. Task Failure • Task fails with exception  retry it • RDDs are immutable and “stateless”, so rerunning should have same effect – Special logic required for tasks which write data out (atomic rename) – Statelessness not enforced by programming model sc.parallelize(0 until 100).map { x => val myVal = sys.prop(“foo”, 0) + x sys.prop(“foo”) = myVal myVal }
  • 31. Task Failure HDFS /names/0.gz /names/3.gz HDFS /names/1.gz /names/0.gz HDFS /names/2.gz /names/3.gz /names/0.gz HadoopRDD map() /names/1.gz HadoopRDD map() /names/2.gz HadoopRDD map() Time /names/3.gz HadoopRDD map() /names/2.gz HadoopRDD map()
  • 32. Speculative Execeution • Try to predict slow or failing tasks, restart task on a different machine in parallel • Also assumes immutability and statelessness • Enable with “spark.speculation=true”
  • 33. Speculative Execution HDFS /names/0.gz /names/3.gz HDFS /names/1.gz /names/0.gz HDFS /names/2.gz /names/3.gz /names/0.gz HadoopRDD map() /names/1.gz HadoopRDD map() /names/2.gz HadoopRDD map() Time /names/3.gz HadoopRDD map()
  • 34. Speculative Execution HDFS /names/0.gz /names/3.gz HDFS /names/1.gz /names/0.gz HDFS /names/2.gz /names/3.gz /names/0.gz HadoopRDD map() /names/1.gz HadoopRDD map() /names/2.gz HadoopRDD map() Time /names/3.gz HadoopRDD map()
  • 35. Speculative Execution HDFS /names/0.gz /names/3.gz HDFS /names/1.gz /names/0.gz HDFS /names/2.gz /names/3.gz /names/0.gz HadoopRDD map() /names/1.gz HadoopRDD map() /names/2.gz HadoopRDD map() Time /names/3.gz HadoopRDD map()
  • 36. Speculative Execution HDFS /names/0.gz /names/3.gz HDFS /names/1.gz /names/0.gz HDFS /names/2.gz /names/3.gz /names/0.gz HadoopRDD map() /names/1.gz HadoopRDD map() /names/2.gz HadoopRDD map() Time /names/3.gz HadoopRDD map() /names/3.gz HadoopRDD map()
  • 37. Speculative Execution HDFS /names/0.gz /names/3.gz HDFS /names/1.gz /names/0.gz HDFS /names/2.gz /names/3.gz /names/0.gz HadoopRDD map() /names/1.gz HadoopRDD map() /names/2.gz HadoopRDD map() Time /names/3.gz HadoopRDD map() /names/3.gz HadoopRDD map()
  • 38. Executor Failure • Examine tasks run on that executor: – If task from final stage, we’ve already collected its results – don’t rerun – If task from intermediate stage, must rerun. • May require executing “finished” stage
  • 39. Step 3: Schedule tasks HDFS /names/0.gz /names/3.gz HDFS /names/1.gz /names/0.gz HDFS /names/2.gz /names/3.gz Time /names/0.gz HadoopRDD map() /names/1.gz HadoopRDD map() /names/2.gz HadoopRDD map() /names/3.gz HadoopRDD map() Reduce 0 reduceByKey collect Reduce 1 reduceByKey collect Reduce 2 reduceByKey collect Reduce 3 reduceByKey collect
  • 40. Step 3: Schedule tasks HDFS /names/0.gz /names/3.gz HDFS /names/1.gz /names/0.gz HDFS /names/2.gz /names/3.gz Time /names/0.gz HadoopRDD map() /names/1.gz HadoopRDD map() /names/2.gz HadoopRDD map() /names/3.gz HadoopRDD map() Reduce 0 reduceByKey collect Reduce 1 reduceByKey collect Reduce 2 reduceByKey collect Reduce 3 reduceByKey collect
  • 41. Step 3: Schedule tasks /names/0.gz HadoopRDD map() HDFS /names/1.gz /names/0.gz HDFS /names/2.gz /names/3.gz Time /names/1.gz HadoopRDD map() /names/2.gz HadoopRDD map() Reduce 0 reduceByKey collect Reduce 1 reduceByKey collect Reduce 2 reduceByKey collect Reduce 3 reduceByKey collect Tasks to rerun: /names/3.gz HadoopRDD map() Completed tasks: /names/0.gz HadoopRDD map() /names/3.gz HadoopRDD map() Reduce 3 reduceByKey collect
  • 42. Other Failure Scenarios What happens when: 1. We have a large number of stages? 2. Our input data is not immutable (e.g. streaming)? 3. Executors had cached data?
  • 43. 1. Dealing with many stages Problem: Executor loss causes recomputation of all non-final stages Solution: Checkpoint whole RDD to HDFS periodically Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7
  • 44. 1. Dealing with many stages Problem: Executor loss causes recomputation of all non-final stages Solution: Checkpoint whole RDD to HDFS periodically HDFS Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7
  • 45. 2. Dealing with lost input data Problem: Input data is consumed when read (e.g., streaming), and re-execution is not possible. Solution: No general solution today – either use an HDFS source or implement it yourself. Spark 1.2 roadmap includes a general solution which may trade throughput for safety.
  • 46. 3. Loss of cached data Problem: Executor loss causes cache to become incomplete. Solution: Do nothing – a task caches data locally while it runs, causing the cache to stabilize.
  • 47. 3. Loss of cached data val file = sc.textFile(“s3n://”).cache() // 8 blocks for (i <- 0 until 10) { file.count() } Cache Block 0 Block 2 Block 4 Block 6 Block 0 Block 2 Block 4 Block 6 Cache Block 1 Block 3 Block 5 Block 7 Block 1 Block 3 Block 5 Block 7 i = 0 i = 1 Block 0 Block 2 Block 4 Block 6 Block 1 Block 3 Block 5 Block 7
  • 48. 3. Loss of cached data val file = sc.textFile(“s3n://”).cache() for (i <- 0 until 10) { file.count() } Cache i = 0 i = 1 Block 2 Block 4 Block 6 Block 7 Cache Block 1 Block 3 Block 5 Block 7 Block 1 Block 3 Block 5 Block 7 Block 1 Block 3 Block 5 Block 7 Block 2 Block 4 Block 6 Block 7 Block 0 Block 1 Block 3 Block 5 Block 0 i = 2 i = 3 Block 2 Block 4 Block 6 Block 7 Block 0 Block 1 Block 3 Block 5
  • 49. Conclusions • Spark comes equipped to handle the most common forms of failure • Special care must be taken in certain cases: – Highly iterative use-cases (checkpointing) – Streaming (atomic data consumption) – Violating Spark’s core immutability and statelessness assumptions