0
Challenges to Error
Diagnosis in Hadoop
Ecosystems
Jim Li, Siyuan He, Liming Zhu,
Xiwei Xu, Min Fu, Len Bass, Anna
Liu, An...
About NICTA
National ICT Australia
• Federal and state funded research
company established in 2002
• Largest ICT research ...
Problem
• Operator invokes some process in cloud (e.g. rolling
upgrade or installation)
• 45 minutes or an hour later – th...
But aren’t there tools and recipes?
• Yes – but …
• Recipes for deployment tools make assumptions
about what you want.
• I...
What are the difficulties associated with
using logs?
• The system being deployed is an ecosystem
with multiple independen...
Our envisioned deployment solution
• A solution will
– Execute the correct steps in a correct order
– The execution of a s...
Rolling upgrade process model example
• Attach assertions to process
model to test state
• Use progress within process
to ...
This paper
• Makes a contribution to the envisioned
repository
– Present 15 examples of systems/possible root causes
for H...
What did we do?
• We manually deployed HBase/Hadoop on
EC2
– 5 NICTA people from 2 different groups
– 10 installations in ...
Case study
Hbase Cluster on Amazon EC2

NICTA Copyright 2012

From imagination to impact

10
Sample Errors - 1
• Source – HDFS
• Logging Exception: “DataNode is Shutting Down”
• Possible Causes/diagnostics
– Instanc...
Sample errors - 2
• Source: Zookeeper
• Logging exception:
“java.net.UnknownHostException”
• Possible causes/diagnostics:
...
Comments on Errors
• Paper has
– 15 enumerated exceptions and potential causes
– Discussion of classification of errors an...
Different types of errors
• Operational errors
– Start up/shutdown errors
– Artifacts not created or created incorrectly

...
Challenges to trouble shooting
from logs
• Inconsistency among logs
• Signal to noise ratio
• Uncertain correlations

NICT...
Inconsistency among logs
• IP address is used as ID but IP addresses can
change in the cloud. For example, if an instance
...
Signal to noise ratio
• Logs contain huge amount of information
• Tools exist to collect logs into a central source
–
–
–
...
Uncertain correlation
• Between exceptions
– Connections among exceptions arising from the same
cause are difficult to det...
Summary
• Deploying or updating ecosystems is an error
prone activity
• Determining root cause of an error is difficult an...
Upcoming SlideShare
Loading in...5
×

Error in hadoop

464

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
464
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Error in hadoop"

  1. 1. Challenges to Error Diagnosis in Hadoop Ecosystems Jim Li, Siyuan He, Liming Zhu, Xiwei Xu, Min Fu, Len Bass, Anna Liu, An Binh Tran NICTA Copyright 2012 From imagination to impact
  2. 2. About NICTA National ICT Australia • Federal and state funded research company established in 2002 • Largest ICT research resource in Australia • National impact is an important success metric • ~700 staff/students working in 5 labs across major capital cities • 7 university partners • Providing R&D services, knowledge transfer to Australian (and global) ICT industry NICTA technology is in over 1 billion mobile phones 2 NICTA Copyright 2012 From imagination to impact
  3. 3. Problem • Operator invokes some process in cloud (e.g. rolling upgrade or installation) • 45 minutes or an hour later – the process fails – Usually with an error message – Possibly with a silent failure that manifests itself much later • Operator must then diagnose failure • Problem is most complicated when multiple components are involved. NICTA Copyright 2012 From imagination to impact 3
  4. 4. But aren’t there tools and recipes? • Yes – but … • Recipes for deployment tools make assumptions about what you want. • In many cases, these assumptions are wrong. • In these cases, you must troubleshoot installation problems. • Troubleshooting is based on examination of generated logs. NICTA Copyright 2012 From imagination to impact 4
  5. 5. What are the difficulties associated with using logs? • The system being deployed is an ecosystem with multiple independently developed systems. Each component’s logging is independently determined and not under central control. – Events and state deemed worthy to log may be different from different components • Results in – Sequence of events leading to failure may be difficult to reproduce – Missing or contradictory information in combined logs From imagination to impact NICTA Copyright 2012 5
  6. 6. Our envisioned deployment solution • A solution will – Execute the correct steps in a correct order – The execution of a step will result in a correct state of the environment • Use a process model annotated with assertions to detect incorrect steps or incorrect state • The detection of an error will trigger a look up in a repository that maps symptoms to fault trees to root causes. NICTA Copyright 2012 From imagination to impact 6
  7. 7. Rolling upgrade process model example • Attach assertions to process model to test state • Use progress within process to determine which assertions to test. • This approach restricts root cause determination to particular step in the process. Update Auto Scaling Group Sort Instances Confirm Upgrade Spec Remove & Deregister Old Instance from ELB Terminate Old Instance Wait for ASG to Start New Instance Register New Instance with ELB NICTA Copyright 2012 From imagination to impact 7
  8. 8. This paper • Makes a contribution to the envisioned repository – Present 15 examples of systems/possible root causes for Hbase/Hadoop deployment • Provides a classification of errors into – – – – Operational Configuration Software Resource • Identifies specific error diagnosis challenges in multi-layer ecosystems. NICTA Copyright 2012 From imagination to impact 8
  9. 9. What did we do? • We manually deployed HBase/Hadoop on EC2 – 5 NICTA people from 2 different groups – 10 installations in total • We diagnosed and recorded errors we discovered – With help from a Citibank person NICTA Copyright 2012 From imagination to impact 9
  10. 10. Case study Hbase Cluster on Amazon EC2 NICTA Copyright 2012 From imagination to impact 10
  11. 11. Sample Errors - 1 • Source – HDFS • Logging Exception: “DataNode is Shutting Down” • Possible Causes/diagnostics – Instance is down/ping ssh connection – Access permission/check authentication keys, ssh connection – HDFS configuration/check “conf/slaves” – HDFS missing component/check data node settings and directories NICTA Copyright 2012 From imagination to impact 11
  12. 12. Sample errors - 2 • Source: Zookeeper • Logging exception: “java.net.UnknownHostException” • Possible causes/diagnostics: – – – – – DSN/check DSN configuration Network connection/check with ssh Zookeeper configuration: zoo.cfg Zookeeper status/processes (PID and JPS) Cross-node configuration error/check consistency NICTA Copyright 2012 From imagination to impact 12
  13. 13. Comments on Errors • Paper has – 15 enumerated exceptions and potential causes – Discussion of classification of errors and samples • Most useful to non-expert installers • Information could potentially be found on – Stack Overflow – Specific source forums • Better to have – Consistent form for fault trees – Known place to find them – Standard environmental description NICTA Copyright 2012 From imagination to impact 13
  14. 14. Different types of errors • Operational errors – Start up/shutdown errors – Artifacts not created or created incorrectly • Configuration errors – Syntactic errors – Cross system inconsistency • Software errrors – Compatibility errors – Bugs in the software • Resource errors – Resource unavailability or exhaustion NICTA Copyright 2012 From imagination to impact 14
  15. 15. Challenges to trouble shooting from logs • Inconsistency among logs • Signal to noise ratio • Uncertain correlations NICTA Copyright 2012 From imagination to impact 15
  16. 16. Inconsistency among logs • IP address is used as ID but IP addresses can change in the cloud. For example, if an instance is restarted. • Inconsistent time stamps in a distributed environment due to network latency makes determination of a sequence of events difficult. NICTA Copyright 2012 From imagination to impact 16
  17. 17. Signal to noise ratio • Logs contain huge amount of information • Tools exist to collect logs into a central source – – – – Scribe Flume Logstash Chukwa • Tools that search logs need guidance to filter information • We propose an approach that uses a process model to guide diagnosis (to be explained shortly). NICTA Copyright 2012 From imagination to impact 17
  18. 18. Uncertain correlation • Between exceptions – Connections among exceptions arising from the same cause are difficult to detect. • Between component states – Dependent relations among component states not shown in log messages and are difficult to detect. • Between events – Connections among distributed events are difficult to detect. • Between states and events – Diagnosis depends on connecting state and events and these may not be obvious from log messages. NICTA Copyright 2012 From imagination to impact 18
  19. 19. Summary • Deploying or updating ecosystems is an error prone activity • Determining root cause of an error is difficult and time consuming • We provided a list of 15 specific errors and their potential root causes for Hbase/Hadoop deployment • We categorized types of errors and uncertainties in error diagnosis NICTA Copyright 2012 From imagination to impact 19
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×