SlideShare a Scribd company logo
1 of 19
Download to read offline
Challenges to Error
Diagnosis in Hadoop
Ecosystems
Jim Li, Siyuan He, Liming Zhu,
Xiwei Xu, Min Fu, Len Bass, Anna
Liu, An Binh Tran

NICTA Copyright 2012

From imagination to impact
About NICTA
National ICT Australia
• Federal and state funded research
company established in 2002
• Largest ICT research resource in
Australia
• National impact is an important
success metric
• ~700 staff/students working in 5 labs
across major capital cities
• 7 university partners
• Providing R&D services, knowledge
transfer to Australian (and global) ICT
industry

NICTA technology is
in over 1 billion mobile
phones

2
NICTA Copyright 2012

From imagination to impact
Problem
• Operator invokes some process in cloud (e.g. rolling
upgrade or installation)
• 45 minutes or an hour later – the process fails
– Usually with an error message
– Possibly with a silent failure that manifests itself much
later
• Operator must then
diagnose failure
• Problem is most
complicated when
multiple components
are involved.
NICTA Copyright 2012

From imagination to impact

3
But aren’t there tools and recipes?
• Yes – but …
• Recipes for deployment tools make assumptions
about what you want.
• In many cases, these assumptions are wrong.
• In these cases, you must troubleshoot
installation problems.
• Troubleshooting is based on examination of
generated logs.

NICTA Copyright 2012

From imagination to impact

4
What are the difficulties associated with
using logs?
• The system being deployed is an ecosystem
with multiple independently developed
systems. Each component’s logging is
independently determined and not under
central control.
– Events and state deemed worthy to log
may be different from different
components
• Results in
– Sequence of events leading to failure may
be difficult to reproduce
– Missing or contradictory information in
combined logs From imagination to impact
NICTA Copyright 2012

5
Our envisioned deployment solution
• A solution will
– Execute the correct steps in a correct order
– The execution of a step will result in a correct state of the
environment

• Use a process model annotated with assertions to detect
incorrect steps or incorrect state
• The detection of an error will trigger a look up in a
repository that maps symptoms to fault trees to root
causes.

NICTA Copyright 2012

From imagination to impact

6
Rolling upgrade process model example
• Attach assertions to process
model to test state
• Use progress within process
to determine which
assertions to test.
• This approach restricts root
cause determination to
particular step in the
process.

Update Auto Scaling
Group
Sort Instances
Confirm Upgrade Spec

Remove & Deregister
Old Instance from ELB

Terminate Old
Instance

Wait for ASG to Start
New Instance

Register New Instance
with ELB

NICTA Copyright 2012

From imagination to impact

7
This paper
• Makes a contribution to the envisioned
repository
– Present 15 examples of systems/possible root causes
for Hbase/Hadoop deployment

• Provides a classification of errors into
–
–
–
–

Operational
Configuration
Software
Resource

• Identifies specific error diagnosis challenges in
multi-layer ecosystems.
NICTA Copyright 2012

From imagination to impact

8
What did we do?
• We manually deployed HBase/Hadoop on
EC2
– 5 NICTA people from 2 different groups
– 10 installations in total

• We diagnosed and recorded errors we
discovered
– With help from a Citibank person

NICTA Copyright 2012

From imagination to impact

9
Case study
Hbase Cluster on Amazon EC2

NICTA Copyright 2012

From imagination to impact

10
Sample Errors - 1
• Source – HDFS
• Logging Exception: “DataNode is Shutting Down”
• Possible Causes/diagnostics
– Instance is down/ping ssh connection
– Access permission/check authentication keys, ssh
connection
– HDFS configuration/check “conf/slaves”
– HDFS missing component/check data node settings
and directories

NICTA Copyright 2012

From imagination to impact

11
Sample errors - 2
• Source: Zookeeper
• Logging exception:
“java.net.UnknownHostException”
• Possible causes/diagnostics:
–
–
–
–
–

DSN/check DSN configuration
Network connection/check with ssh
Zookeeper configuration: zoo.cfg
Zookeeper status/processes (PID and JPS)
Cross-node configuration error/check consistency

NICTA Copyright 2012

From imagination to impact

12
Comments on Errors
• Paper has
– 15 enumerated exceptions and potential causes
– Discussion of classification of errors and samples

• Most useful to non-expert installers
• Information could potentially be found on
– Stack Overflow
– Specific source forums

• Better to have
– Consistent form for fault trees
– Known place to find them
– Standard environmental description
NICTA Copyright 2012

From imagination to impact

13
Different types of errors
• Operational errors
– Start up/shutdown errors
– Artifacts not created or created incorrectly

• Configuration errors
– Syntactic errors
– Cross system inconsistency

• Software errrors
– Compatibility errors
– Bugs in the software

• Resource errors
– Resource unavailability or exhaustion
NICTA Copyright 2012

From imagination to impact

14
Challenges to trouble shooting
from logs
• Inconsistency among logs
• Signal to noise ratio
• Uncertain correlations

NICTA Copyright 2012

From imagination to impact

15
Inconsistency among logs
• IP address is used as ID but IP addresses can
change in the cloud. For example, if an instance
is restarted.
• Inconsistent time stamps in a distributed
environment due to network latency makes
determination of a sequence of events difficult.

NICTA Copyright 2012

From imagination to impact

16
Signal to noise ratio
• Logs contain huge amount of information
• Tools exist to collect logs into a central source
–
–
–
–

Scribe
Flume
Logstash
Chukwa

• Tools that search logs need guidance to filter
information
• We propose an approach that uses a process
model to guide diagnosis (to be explained
shortly).
NICTA Copyright 2012

From imagination to impact

17
Uncertain correlation
• Between exceptions
– Connections among exceptions arising from the same
cause are difficult to detect.

• Between component states
– Dependent relations among component states not
shown in log messages and are difficult to detect.

• Between events
– Connections among distributed events are difficult to
detect.

• Between states and events
– Diagnosis depends on connecting state and events
and these may not be obvious from log messages.
NICTA Copyright 2012

From imagination to impact

18
Summary
• Deploying or updating ecosystems is an error
prone activity
• Determining root cause of an error is difficult and
time consuming
• We provided a list of 15 specific errors and their
potential root causes for Hbase/Hadoop
deployment
• We categorized types of errors and uncertainties
in error diagnosis

NICTA Copyright 2012

From imagination to impact

19

More Related Content

What's hot

Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...
Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...
Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...Continuent
 
The Architect's Two Hats
The Architect's Two HatsThe Architect's Two Hats
The Architect's Two HatsBen Stopford
 
Designing large scale distributed systems
Designing large scale distributed systemsDesigning large scale distributed systems
Designing large scale distributed systemsAshwani Priyedarshi
 
Simple Solutions for Complex Problems
Simple Solutions for Complex ProblemsSimple Solutions for Complex Problems
Simple Solutions for Complex ProblemsTyler Treat
 
Dynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremDynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremGrisha Weintraub
 
Cloud computing
Cloud computingCloud computing
Cloud computingAaron Tushabe
 
Base paper ppt-. A load balancing model based on cloud partitioning for the ...
Base paper ppt-. A  load balancing model based on cloud partitioning for the ...Base paper ppt-. A  load balancing model based on cloud partitioning for the ...
Base paper ppt-. A load balancing model based on cloud partitioning for the ...Lavanya Vigrahala
 
load balancing in public cloud ppt
load balancing in public cloud pptload balancing in public cloud ppt
load balancing in public cloud pptKrishna Kumar
 
Design principles of scalable, distributed systems
Design principles of scalable, distributed systemsDesign principles of scalable, distributed systems
Design principles of scalable, distributed systemsTinniam V Ganesh (TV)
 
The Economics of Scale: Promises and Perils of Going Distributed
The Economics of Scale: Promises and Perils of Going DistributedThe Economics of Scale: Promises and Perils of Going Distributed
The Economics of Scale: Promises and Perils of Going DistributedTyler Treat
 
Big Data for QAs
Big Data for QAsBig Data for QAs
Big Data for QAsAhmed Misbah
 
Optimal load balancing in cloud computing
Optimal load balancing in cloud computingOptimal load balancing in cloud computing
Optimal load balancing in cloud computingPriyanka Bhowmick
 
Continuent Tungsten - Scalable Saa S Data Management
Continuent Tungsten - Scalable Saa S Data ManagementContinuent Tungsten - Scalable Saa S Data Management
Continuent Tungsten - Scalable Saa S Data Managementguest2e11e8
 
Distributed Database practicals
Distributed Database practicals Distributed Database practicals
Distributed Database practicals Vrushali Lanjewar
 
Load Balancing in Cloud Computing Environment: A Comparative Study of Service...
Load Balancing in Cloud Computing Environment: A Comparative Study of Service...Load Balancing in Cloud Computing Environment: A Comparative Study of Service...
Load Balancing in Cloud Computing Environment: A Comparative Study of Service...Eswar Publications
 
Case Study - How Rackspace Query Terabytes Of Data
Case Study - How Rackspace Query Terabytes Of DataCase Study - How Rackspace Query Terabytes Of Data
Case Study - How Rackspace Query Terabytes Of DataSchubert Zhang
 
Why Distributed Databases?
Why Distributed Databases?Why Distributed Databases?
Why Distributed Databases?Sargun Dhillon
 
10 Tricks to Ensure Your Oracle Coherence Cluster is Not a "Black Box" in Pro...
10 Tricks to Ensure Your Oracle Coherence Cluster is Not a "Black Box" in Pro...10 Tricks to Ensure Your Oracle Coherence Cluster is Not a "Black Box" in Pro...
10 Tricks to Ensure Your Oracle Coherence Cluster is Not a "Black Box" in Pro...SL Corporation
 
Running MariaDB in multiple data centers
Running MariaDB in multiple data centersRunning MariaDB in multiple data centers
Running MariaDB in multiple data centersMariaDB plc
 

What's hot (20)

Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...
Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...
Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...
 
The Architect's Two Hats
The Architect's Two HatsThe Architect's Two Hats
The Architect's Two Hats
 
Designing large scale distributed systems
Designing large scale distributed systemsDesigning large scale distributed systems
Designing large scale distributed systems
 
Simple Solutions for Complex Problems
Simple Solutions for Complex ProblemsSimple Solutions for Complex Problems
Simple Solutions for Complex Problems
 
Dynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremDynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theorem
 
Cloud computing
Cloud computingCloud computing
Cloud computing
 
Base paper ppt-. A load balancing model based on cloud partitioning for the ...
Base paper ppt-. A  load balancing model based on cloud partitioning for the ...Base paper ppt-. A  load balancing model based on cloud partitioning for the ...
Base paper ppt-. A load balancing model based on cloud partitioning for the ...
 
load balancing in public cloud ppt
load balancing in public cloud pptload balancing in public cloud ppt
load balancing in public cloud ppt
 
Design principles of scalable, distributed systems
Design principles of scalable, distributed systemsDesign principles of scalable, distributed systems
Design principles of scalable, distributed systems
 
The Economics of Scale: Promises and Perils of Going Distributed
The Economics of Scale: Promises and Perils of Going DistributedThe Economics of Scale: Promises and Perils of Going Distributed
The Economics of Scale: Promises and Perils of Going Distributed
 
Big Data for QAs
Big Data for QAsBig Data for QAs
Big Data for QAs
 
Optimal load balancing in cloud computing
Optimal load balancing in cloud computingOptimal load balancing in cloud computing
Optimal load balancing in cloud computing
 
Continuent Tungsten - Scalable Saa S Data Management
Continuent Tungsten - Scalable Saa S Data ManagementContinuent Tungsten - Scalable Saa S Data Management
Continuent Tungsten - Scalable Saa S Data Management
 
My Dissertation 2016
My Dissertation 2016My Dissertation 2016
My Dissertation 2016
 
Distributed Database practicals
Distributed Database practicals Distributed Database practicals
Distributed Database practicals
 
Load Balancing in Cloud Computing Environment: A Comparative Study of Service...
Load Balancing in Cloud Computing Environment: A Comparative Study of Service...Load Balancing in Cloud Computing Environment: A Comparative Study of Service...
Load Balancing in Cloud Computing Environment: A Comparative Study of Service...
 
Case Study - How Rackspace Query Terabytes Of Data
Case Study - How Rackspace Query Terabytes Of DataCase Study - How Rackspace Query Terabytes Of Data
Case Study - How Rackspace Query Terabytes Of Data
 
Why Distributed Databases?
Why Distributed Databases?Why Distributed Databases?
Why Distributed Databases?
 
10 Tricks to Ensure Your Oracle Coherence Cluster is Not a "Black Box" in Pro...
10 Tricks to Ensure Your Oracle Coherence Cluster is Not a "Black Box" in Pro...10 Tricks to Ensure Your Oracle Coherence Cluster is Not a "Black Box" in Pro...
10 Tricks to Ensure Your Oracle Coherence Cluster is Not a "Black Box" in Pro...
 
Running MariaDB in multiple data centers
Running MariaDB in multiple data centersRunning MariaDB in multiple data centers
Running MariaDB in multiple data centers
 

Similar to Error in hadoop

POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...
POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...
POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...Liming Zhu
 
Challenges in Practicing High Frequency Releases in Cloud Environments
Challenges in Practicing High Frequency Releases in Cloud Environments Challenges in Practicing High Frequency Releases in Cloud Environments
Challenges in Practicing High Frequency Releases in Cloud Environments Liming Zhu
 
The quality attribute of upgradability
The quality attribute of upgradabilityThe quality attribute of upgradability
The quality attribute of upgradabilityLen Bass
 
Dependable Operation - Performance Management and Capacity Planning Under Con...
Dependable Operation - Performance Management and Capacity Planning Under Con...Dependable Operation - Performance Management and Capacity Planning Under Con...
Dependable Operation - Performance Management and Capacity Planning Under Con...Liming Zhu
 
Architectural Tactics for Large Scale Systems
Architectural Tactics for Large Scale SystemsArchitectural Tactics for Large Scale Systems
Architectural Tactics for Large Scale SystemsLen Bass
 
SplunkLive! Austin Customer Presentation - Dell
SplunkLive! Austin Customer Presentation - DellSplunkLive! Austin Customer Presentation - Dell
SplunkLive! Austin Customer Presentation - DellSplunk
 
Net essentials6e ch13
Net essentials6e ch13Net essentials6e ch13
Net essentials6e ch13APSU
 
Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...
Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...
Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...Databricks
 
All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databri...
All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databri...All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databri...
All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databri...Databricks
 
Fast and effective analysis of architecture diagrams
Fast and effective analysis of architecture diagrams Fast and effective analysis of architecture diagrams
Fast and effective analysis of architecture diagrams GlobalLogic Ukraine
 
Siegel - keynote presentation, 18 may 2013
Siegel  - keynote presentation, 18 may 2013Siegel  - keynote presentation, 18 may 2013
Siegel - keynote presentation, 18 may 2013NeilSiegelslideshare
 
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentHostedbyConfluent
 
Delivering Operational Intelligence at NAB with Splunk, Gartner Symposium ITX...
Delivering Operational Intelligence at NAB with Splunk, Gartner Symposium ITX...Delivering Operational Intelligence at NAB with Splunk, Gartner Symposium ITX...
Delivering Operational Intelligence at NAB with Splunk, Gartner Symposium ITX...Splunk
 
The 5S Approach to Performance Tuning by Chuck Ezell
The 5S Approach to Performance Tuning by Chuck EzellThe 5S Approach to Performance Tuning by Chuck Ezell
The 5S Approach to Performance Tuning by Chuck EzellDatavail
 
Network Automation Journey, A systems engineer NetOps perspective
Network Automation Journey, A systems engineer NetOps perspectiveNetwork Automation Journey, A systems engineer NetOps perspective
Network Automation Journey, A systems engineer NetOps perspectiveWalid Shaari
 
Troubleshooting: A High-Value Asset For The Service-Provider Discipline
Troubleshooting: A High-Value Asset For The Service-Provider DisciplineTroubleshooting: A High-Value Asset For The Service-Provider Discipline
Troubleshooting: A High-Value Asset For The Service-Provider DisciplineSagi Brody
 
Why advanced monitoring is key for healthy
Why advanced monitoring is key for healthyWhy advanced monitoring is key for healthy
Why advanced monitoring is key for healthyDenodo
 
Performance Analysis of Idle Programs
Performance Analysis of Idle ProgramsPerformance Analysis of Idle Programs
Performance Analysis of Idle Programsgreenwop
 
Some Oracle AWR observations
Some Oracle AWR observationsSome Oracle AWR observations
Some Oracle AWR observationsConnor McDonald
 

Similar to Error in hadoop (20)

POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...
POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...
POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...
 
Challenges in Practicing High Frequency Releases in Cloud Environments
Challenges in Practicing High Frequency Releases in Cloud Environments Challenges in Practicing High Frequency Releases in Cloud Environments
Challenges in Practicing High Frequency Releases in Cloud Environments
 
The quality attribute of upgradability
The quality attribute of upgradabilityThe quality attribute of upgradability
The quality attribute of upgradability
 
Dependable Operation - Performance Management and Capacity Planning Under Con...
Dependable Operation - Performance Management and Capacity Planning Under Con...Dependable Operation - Performance Management and Capacity Planning Under Con...
Dependable Operation - Performance Management and Capacity Planning Under Con...
 
Architectural Tactics for Large Scale Systems
Architectural Tactics for Large Scale SystemsArchitectural Tactics for Large Scale Systems
Architectural Tactics for Large Scale Systems
 
SplunkLive! Austin Customer Presentation - Dell
SplunkLive! Austin Customer Presentation - DellSplunkLive! Austin Customer Presentation - Dell
SplunkLive! Austin Customer Presentation - Dell
 
Big data pipelines
Big data pipelinesBig data pipelines
Big data pipelines
 
Net essentials6e ch13
Net essentials6e ch13Net essentials6e ch13
Net essentials6e ch13
 
Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...
Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...
Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...
 
All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databri...
All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databri...All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databri...
All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databri...
 
Fast and effective analysis of architecture diagrams
Fast and effective analysis of architecture diagrams Fast and effective analysis of architecture diagrams
Fast and effective analysis of architecture diagrams
 
Siegel - keynote presentation, 18 may 2013
Siegel  - keynote presentation, 18 may 2013Siegel  - keynote presentation, 18 may 2013
Siegel - keynote presentation, 18 may 2013
 
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
 
Delivering Operational Intelligence at NAB with Splunk, Gartner Symposium ITX...
Delivering Operational Intelligence at NAB with Splunk, Gartner Symposium ITX...Delivering Operational Intelligence at NAB with Splunk, Gartner Symposium ITX...
Delivering Operational Intelligence at NAB with Splunk, Gartner Symposium ITX...
 
The 5S Approach to Performance Tuning by Chuck Ezell
The 5S Approach to Performance Tuning by Chuck EzellThe 5S Approach to Performance Tuning by Chuck Ezell
The 5S Approach to Performance Tuning by Chuck Ezell
 
Network Automation Journey, A systems engineer NetOps perspective
Network Automation Journey, A systems engineer NetOps perspectiveNetwork Automation Journey, A systems engineer NetOps perspective
Network Automation Journey, A systems engineer NetOps perspective
 
Troubleshooting: A High-Value Asset For The Service-Provider Discipline
Troubleshooting: A High-Value Asset For The Service-Provider DisciplineTroubleshooting: A High-Value Asset For The Service-Provider Discipline
Troubleshooting: A High-Value Asset For The Service-Provider Discipline
 
Why advanced monitoring is key for healthy
Why advanced monitoring is key for healthyWhy advanced monitoring is key for healthy
Why advanced monitoring is key for healthy
 
Performance Analysis of Idle Programs
Performance Analysis of Idle ProgramsPerformance Analysis of Idle Programs
Performance Analysis of Idle Programs
 
Some Oracle AWR observations
Some Oracle AWR observationsSome Oracle AWR observations
Some Oracle AWR observations
 

More from Len Bass

Devops syllabus
Devops syllabusDevops syllabus
Devops syllabusLen Bass
 
DevOps Syllabus summer 2020
DevOps Syllabus summer 2020DevOps Syllabus summer 2020
DevOps Syllabus summer 2020Len Bass
 
11 secure development
11  secure development 11  secure development
11 secure development Len Bass
 
10 disaster recovery
10 disaster recovery  10 disaster recovery
10 disaster recovery Len Bass
 
9 postproduction
9 postproduction 9 postproduction
9 postproduction Len Bass
 
8 pipeline
8 pipeline 8 pipeline
8 pipeline Len Bass
 
7 configuration management
7 configuration management 7 configuration management
7 configuration management Len Bass
 
6 microservice architecture
6 microservice architecture6 microservice architecture
6 microservice architectureLen Bass
 
5 infrastructure security
5 infrastructure security5 infrastructure security
5 infrastructure securityLen Bass
 
4 container management
4  container management4  container management
4 container managementLen Bass
 
3 the cloud
3 the cloud 3 the cloud
3 the cloud Len Bass
 
1 virtual machines
1 virtual machines1 virtual machines
1 virtual machinesLen Bass
 
2 networking
2 networking2 networking
2 networkingLen Bass
 
Quantum talk
Quantum talkQuantum talk
Quantum talkLen Bass
 
Icsa2018 blockchain tutorial
Icsa2018 blockchain tutorialIcsa2018 blockchain tutorial
Icsa2018 blockchain tutorialLen Bass
 
Experience in teaching devops
Experience in teaching devopsExperience in teaching devops
Experience in teaching devopsLen Bass
 
Understanding blockchains
Understanding blockchainsUnderstanding blockchains
Understanding blockchainsLen Bass
 
What is a blockchain
What is a blockchainWhat is a blockchain
What is a blockchainLen Bass
 
Dev ops and safety critical systems
Dev ops and safety critical systemsDev ops and safety critical systems
Dev ops and safety critical systemsLen Bass
 
My first deployment pipeline
My first deployment pipelineMy first deployment pipeline
My first deployment pipelineLen Bass
 

More from Len Bass (20)

Devops syllabus
Devops syllabusDevops syllabus
Devops syllabus
 
DevOps Syllabus summer 2020
DevOps Syllabus summer 2020DevOps Syllabus summer 2020
DevOps Syllabus summer 2020
 
11 secure development
11  secure development 11  secure development
11 secure development
 
10 disaster recovery
10 disaster recovery  10 disaster recovery
10 disaster recovery
 
9 postproduction
9 postproduction 9 postproduction
9 postproduction
 
8 pipeline
8 pipeline 8 pipeline
8 pipeline
 
7 configuration management
7 configuration management 7 configuration management
7 configuration management
 
6 microservice architecture
6 microservice architecture6 microservice architecture
6 microservice architecture
 
5 infrastructure security
5 infrastructure security5 infrastructure security
5 infrastructure security
 
4 container management
4  container management4  container management
4 container management
 
3 the cloud
3 the cloud 3 the cloud
3 the cloud
 
1 virtual machines
1 virtual machines1 virtual machines
1 virtual machines
 
2 networking
2 networking2 networking
2 networking
 
Quantum talk
Quantum talkQuantum talk
Quantum talk
 
Icsa2018 blockchain tutorial
Icsa2018 blockchain tutorialIcsa2018 blockchain tutorial
Icsa2018 blockchain tutorial
 
Experience in teaching devops
Experience in teaching devopsExperience in teaching devops
Experience in teaching devops
 
Understanding blockchains
Understanding blockchainsUnderstanding blockchains
Understanding blockchains
 
What is a blockchain
What is a blockchainWhat is a blockchain
What is a blockchain
 
Dev ops and safety critical systems
Dev ops and safety critical systemsDev ops and safety critical systems
Dev ops and safety critical systems
 
My first deployment pipeline
My first deployment pipelineMy first deployment pipeline
My first deployment pipeline
 

Recently uploaded

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel AraĂşjo
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 

Recently uploaded (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Error in hadoop

  • 1. Challenges to Error Diagnosis in Hadoop Ecosystems Jim Li, Siyuan He, Liming Zhu, Xiwei Xu, Min Fu, Len Bass, Anna Liu, An Binh Tran NICTA Copyright 2012 From imagination to impact
  • 2. About NICTA National ICT Australia • Federal and state funded research company established in 2002 • Largest ICT research resource in Australia • National impact is an important success metric • ~700 staff/students working in 5 labs across major capital cities • 7 university partners • Providing R&D services, knowledge transfer to Australian (and global) ICT industry NICTA technology is in over 1 billion mobile phones 2 NICTA Copyright 2012 From imagination to impact
  • 3. Problem • Operator invokes some process in cloud (e.g. rolling upgrade or installation) • 45 minutes or an hour later – the process fails – Usually with an error message – Possibly with a silent failure that manifests itself much later • Operator must then diagnose failure • Problem is most complicated when multiple components are involved. NICTA Copyright 2012 From imagination to impact 3
  • 4. But aren’t there tools and recipes? • Yes – but … • Recipes for deployment tools make assumptions about what you want. • In many cases, these assumptions are wrong. • In these cases, you must troubleshoot installation problems. • Troubleshooting is based on examination of generated logs. NICTA Copyright 2012 From imagination to impact 4
  • 5. What are the difficulties associated with using logs? • The system being deployed is an ecosystem with multiple independently developed systems. Each component’s logging is independently determined and not under central control. – Events and state deemed worthy to log may be different from different components • Results in – Sequence of events leading to failure may be difficult to reproduce – Missing or contradictory information in combined logs From imagination to impact NICTA Copyright 2012 5
  • 6. Our envisioned deployment solution • A solution will – Execute the correct steps in a correct order – The execution of a step will result in a correct state of the environment • Use a process model annotated with assertions to detect incorrect steps or incorrect state • The detection of an error will trigger a look up in a repository that maps symptoms to fault trees to root causes. NICTA Copyright 2012 From imagination to impact 6
  • 7. Rolling upgrade process model example • Attach assertions to process model to test state • Use progress within process to determine which assertions to test. • This approach restricts root cause determination to particular step in the process. Update Auto Scaling Group Sort Instances Confirm Upgrade Spec Remove & Deregister Old Instance from ELB Terminate Old Instance Wait for ASG to Start New Instance Register New Instance with ELB NICTA Copyright 2012 From imagination to impact 7
  • 8. This paper • Makes a contribution to the envisioned repository – Present 15 examples of systems/possible root causes for Hbase/Hadoop deployment • Provides a classification of errors into – – – – Operational Configuration Software Resource • Identifies specific error diagnosis challenges in multi-layer ecosystems. NICTA Copyright 2012 From imagination to impact 8
  • 9. What did we do? • We manually deployed HBase/Hadoop on EC2 – 5 NICTA people from 2 different groups – 10 installations in total • We diagnosed and recorded errors we discovered – With help from a Citibank person NICTA Copyright 2012 From imagination to impact 9
  • 10. Case study Hbase Cluster on Amazon EC2 NICTA Copyright 2012 From imagination to impact 10
  • 11. Sample Errors - 1 • Source – HDFS • Logging Exception: “DataNode is Shutting Down” • Possible Causes/diagnostics – Instance is down/ping ssh connection – Access permission/check authentication keys, ssh connection – HDFS configuration/check “conf/slaves” – HDFS missing component/check data node settings and directories NICTA Copyright 2012 From imagination to impact 11
  • 12. Sample errors - 2 • Source: Zookeeper • Logging exception: “java.net.UnknownHostException” • Possible causes/diagnostics: – – – – – DSN/check DSN configuration Network connection/check with ssh Zookeeper configuration: zoo.cfg Zookeeper status/processes (PID and JPS) Cross-node configuration error/check consistency NICTA Copyright 2012 From imagination to impact 12
  • 13. Comments on Errors • Paper has – 15 enumerated exceptions and potential causes – Discussion of classification of errors and samples • Most useful to non-expert installers • Information could potentially be found on – Stack Overflow – Specific source forums • Better to have – Consistent form for fault trees – Known place to find them – Standard environmental description NICTA Copyright 2012 From imagination to impact 13
  • 14. Different types of errors • Operational errors – Start up/shutdown errors – Artifacts not created or created incorrectly • Configuration errors – Syntactic errors – Cross system inconsistency • Software errrors – Compatibility errors – Bugs in the software • Resource errors – Resource unavailability or exhaustion NICTA Copyright 2012 From imagination to impact 14
  • 15. Challenges to trouble shooting from logs • Inconsistency among logs • Signal to noise ratio • Uncertain correlations NICTA Copyright 2012 From imagination to impact 15
  • 16. Inconsistency among logs • IP address is used as ID but IP addresses can change in the cloud. For example, if an instance is restarted. • Inconsistent time stamps in a distributed environment due to network latency makes determination of a sequence of events difficult. NICTA Copyright 2012 From imagination to impact 16
  • 17. Signal to noise ratio • Logs contain huge amount of information • Tools exist to collect logs into a central source – – – – Scribe Flume Logstash Chukwa • Tools that search logs need guidance to filter information • We propose an approach that uses a process model to guide diagnosis (to be explained shortly). NICTA Copyright 2012 From imagination to impact 17
  • 18. Uncertain correlation • Between exceptions – Connections among exceptions arising from the same cause are difficult to detect. • Between component states – Dependent relations among component states not shown in log messages and are difficult to detect. • Between events – Connections among distributed events are difficult to detect. • Between states and events – Diagnosis depends on connecting state and events and these may not be obvious from log messages. NICTA Copyright 2012 From imagination to impact 18
  • 19. Summary • Deploying or updating ecosystems is an error prone activity • Determining root cause of an error is difficult and time consuming • We provided a list of 15 specific errors and their potential root causes for Hbase/Hadoop deployment • We categorized types of errors and uncertainties in error diagnosis NICTA Copyright 2012 From imagination to impact 19