SlideShare a Scribd company logo
1 of 21
8th Hadoop User Group Vienna
@ T-Mobile Austria
September 06, 2017
Hadoop User Group Vienna: Organizer
@StefanDunkler
 Senior Consultant at Hortonworks
 Technical Physics at Vienna UT
 Loves Open Source Technology
https://blog.datahovel.com https://github.com/condla
Hadoop User Group Vienna: Introduction
 Network and Connect to
real Hadoop Users!
 Present technical problems
or solutions!
 Meet the experts!
 Share problems and/or
solutions
 Have a good time! 
“The first elephant in Vienna”
(https://www.wien.gv.at/wiki/index.php/Elefant
Hadoop User Group Vienna: Agenda
 “Disaster Recovery in the Hadoop Ecosystem:
Preparing for the Improbable”
 Stefan Kupstaitis-Dunkler, Hortonworks
 Highly Scalable Machine Learning and Deep
Learning in Real Time with Apache Kafka’s Streams
API
 Kai Wähner, Confluent
 Pizza, Beer and Networking
Disaster Recovery in the Hadoop
Ecosystem
Preparing for the Improbable Stefan Kupstaitis-Dunkler, 2017-09-06
Hadoop Disaster
Recovery: Agenda
 Foundations, Considerations and
Terminology
 Disaster Recovery Solution
Scenarios
 DR Capabilities of Selected
Services
 HDFS
 Hive
 HBase
 Kafka
 Policies and Configuration
Disasters you want to be prepared for
 Human
 Failure.
 Malicious Intent. (Hackers,…)
 Machine
 Failure.
 Malicious Intent.
 Catastrophes
 Nature
 Malicious Intent
 Business Services Interruption
 Data Loss
 Data Theft
caused by
Disasters you want to be prepared for
 Human
 Failure.
 Malicious Intent.
 Machine
 Failure.
 Malicious Intent.  not yet…
 Catastrophes
 Nature
 Malicious Intent
 Business Services Interruption
 Data Loss
 Data Theft
caused by
Disasters you want to be prepared for
 Human
 Failure.
 Malicious Intent.
 Machine
 Failure.
 Malicious Intent.
 Catastrophes (both cause machine failure
 Nature
 Malicious Intent (Terrorism,…)
 Business Services Interruption
 Data Loss
 Data Theft
caused by
But I thought data is pretty safe in Hadoop?
The bare minimum:
 Data replication across nodes (HDFS, Kafka, Solr,…)
 Rack awareness
 Services High Availability (HA)
 Acking, guaranteed processing, handshakes,…
 Fine grained access control (Apache Knox + Ranger)
 Monitoring + Alerting (Apache Ambari)
 Cybersecurity (Apache Metron)
Let’s talk about apples and pears: Terms and
Definitions
 Disaster Recovery vs. Disaster Prevention
 Full/Delta Backup
 Replication
 Snapshots
Considerations
 Why DR?  Disaster Prevention…
 List your data sources? What’s the impact of their loss?
 Backup frequency?
 Recovery speed?
 Choose backup/replication mechanism that fits your business
requirements and your data
 Organize Prioritize Generalize!
DR Solution
Scenarios
There are two disaster recovery solution
scenarios…
Data Center
1
Data
Sources
Data Center
2
Data Center
1
Data
Sources
Data Center
2
Dual Path Cluster Replication
… and these are their differences
Dual Path
 Same ingest pipeline
 Data is identically processed
 Data is identically stored
 Two active clusters
 All data equally available in both clusters/data
centers
 Needs double resources
 Applications can switch between both clusters
(Piloting features, serving different geographical
regions,…)
Cluster Replication
 Data is ingested in one cluster
 Data is processed in one cluster
 Several jobs (DR processes) are running to keep
the other cluster up-to-date
 Choose which data you want to secure (all or
parts of it)
 Needs less processing resources
 Replication/Sync jobs need to be developed
 Both clusters can be used for different work loads and applications.
 Hive = HDFS + Metadata
 Apply HDFS methods
 Backup/replicate relational DB
 Distcp
 command line tool
 Scheduling via Oozie or Falcon (deprecated)
 Transfer of encrypted data either
decrypt/encrypt or raw
 HDFS Snapshots
Disaster Recovery/Prevention Options for
HDFS and Apache Hive
( )
Disaster Recovery for Apache HBase
 CopyTable: MapReduce online (table to table)
 Exports: MapReduce online (table to HDFS)
 Replication: near real time cluster sync
 Snapshots
 HBase Backup: offline
Performance
Impact
( )
( )
( )
 Ambari:
 Cluster config usually not changed often.
 Compare automatically (via Ambari REST interface), sync manually 
 Ranger:
 Security policies can be imported/exported as JSON 
 Policies can also be automatically synced utilizing the Ranger REST interface
 Security Audits are stored in HDFS  can be backup up via distcp
Configs and Policies: Ambari + Ranger
Kafka Mirror Maker
Kafka Cluster 1
Mirror Maker
Kafka Cluster 2
…
…
…
 Mirror Maker is a service included in Apache Kafka
 It acts as a consumer of Kafka cluster 1
 and as a producer to Kafka cluster 2
 Just prepare 2 configuration files (consumer config +
props of cluster 1 and producer config + props of
cluster 2)
 Start it with a simple start command
./kafka-run-class.sh kafka.tools.MirrorMaker --consumer.config consumer.properties 
--num.streams 2 --producer.config producer.properties 
--whitelist="test_topic,test_topic2,test_topic3"
Now what?
 Test your scenario…
 Regularly!
 Implement automated tests to track performance and
completeness of data.
 Idea: Let’s kill random services in production and see what
happens. After all, Hadoop services and applications can handle
such situations, plus now we have a DR strategy in place, that we
trust. Do we?
Thanks for your attention!
Questions?

More Related Content

What's hot

Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101EMC
 
How the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownHow the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownDataWorks Summit
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14John Sing
 
MapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR Technologies
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesDataWorks Summit/Hadoop Summit
 
Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)Wei-Chiu Chuang
 
Hadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsHadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsDataWorks Summit/Hadoop Summit
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive TuningAdam Muise
 
Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Hortonworks
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingDataWorks Summit
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Performance Issues on Hadoop Clusters
Performance Issues on Hadoop ClustersPerformance Issues on Hadoop Clusters
Performance Issues on Hadoop ClustersXiao Qin
 
Introduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsIntroduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsFadi Yousuf
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview EMC
 
Design, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for HadoopDesign, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for Hadoopmcsrivas
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopRan Ziv
 

What's hot (20)

Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
How the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownHow the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside Down
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14
 
MapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document Database
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different Rules
 
Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)
 
Hadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsHadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the Experts
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning
 
Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks
 
10c introduction
10c introduction10c introduction
10c introduction
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Performance Issues on Hadoop Clusters
Performance Issues on Hadoop ClustersPerformance Issues on Hadoop Clusters
Performance Issues on Hadoop Clusters
 
Introduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsIntroduction to Hadoop - The Essentials
Introduction to Hadoop - The Essentials
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Hadoop Fundamentals I
Hadoop Fundamentals IHadoop Fundamentals I
Hadoop Fundamentals I
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview
 
Design, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for HadoopDesign, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 

Similar to Disaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable

Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big pictureJ S Jodha
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Lucidworks
 
Lecture-20.pptx
Lecture-20.pptxLecture-20.pptx
Lecture-20.pptxmohaaalsa
 
Hadoop - Past, Present and Future - v1.1
Hadoop - Past, Present and Future - v1.1Hadoop - Past, Present and Future - v1.1
Hadoop - Past, Present and Future - v1.1Big Data Joe™ Rossi
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupesh Bansal
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesappaji intelhunt
 
20140202 fosdem-nosql-devroom-hadoop-yarn
20140202 fosdem-nosql-devroom-hadoop-yarn20140202 fosdem-nosql-devroom-hadoop-yarn
20140202 fosdem-nosql-devroom-hadoop-yarnDatalayer
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Building Big Data Applications using Spark, Hive, HBase and Kafka
Building Big Data Applications using Spark, Hive, HBase and KafkaBuilding Big Data Applications using Spark, Hive, HBase and Kafka
Building Big Data Applications using Spark, Hive, HBase and KafkaAshish Thapliyal
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATarak Tar
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATarak Tar
 
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...Amazon Web Services
 
Hadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsHadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsChien Chung Shen
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xNPN Training
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkManish Gupta
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfsshrey mehrotra
 
A glimpse of test automation in hadoop ecosystem by Deepika Achary
A glimpse of test automation in hadoop ecosystem by Deepika AcharyA glimpse of test automation in hadoop ecosystem by Deepika Achary
A glimpse of test automation in hadoop ecosystem by Deepika AcharyQA or the Highway
 

Similar to Disaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable (20)

Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
Huhadoop - v1.1
Huhadoop - v1.1Huhadoop - v1.1
Huhadoop - v1.1
 
Lecture-20.pptx
Lecture-20.pptxLecture-20.pptx
Lecture-20.pptx
 
Hadoop - Past, Present and Future - v1.1
Hadoop - Past, Present and Future - v1.1Hadoop - Past, Present and Future - v1.1
Hadoop - Past, Present and Future - v1.1
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologies
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
20140202 fosdem-nosql-devroom-hadoop-yarn
20140202 fosdem-nosql-devroom-hadoop-yarn20140202 fosdem-nosql-devroom-hadoop-yarn
20140202 fosdem-nosql-devroom-hadoop-yarn
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Building Big Data Applications using Spark, Hive, HBase and Kafka
Building Big Data Applications using Spark, Hive, HBase and KafkaBuilding Big Data Applications using Spark, Hive, HBase and Kafka
Building Big Data Applications using Spark, Hive, HBase and Kafka
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
 
Hadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsHadoop Essential for Oracle Professionals
Hadoop Essential for Oracle Professionals
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
A glimpse of test automation in hadoop ecosystem by Deepika Achary
A glimpse of test automation in hadoop ecosystem by Deepika AcharyA glimpse of test automation in hadoop ecosystem by Deepika Achary
A glimpse of test automation in hadoop ecosystem by Deepika Achary
 

Recently uploaded

HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech studentsHimanshiGarg82
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...kalichargn70th171
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionOnePlan Solutions
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfproinshot.com
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesVictorSzoltysek
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 

Recently uploaded (20)

HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 

Disaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable

  • 1. 8th Hadoop User Group Vienna @ T-Mobile Austria September 06, 2017
  • 2. Hadoop User Group Vienna: Organizer @StefanDunkler  Senior Consultant at Hortonworks  Technical Physics at Vienna UT  Loves Open Source Technology https://blog.datahovel.com https://github.com/condla
  • 3. Hadoop User Group Vienna: Introduction  Network and Connect to real Hadoop Users!  Present technical problems or solutions!  Meet the experts!  Share problems and/or solutions  Have a good time!  “The first elephant in Vienna” (https://www.wien.gv.at/wiki/index.php/Elefant
  • 4. Hadoop User Group Vienna: Agenda  “Disaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable”  Stefan Kupstaitis-Dunkler, Hortonworks  Highly Scalable Machine Learning and Deep Learning in Real Time with Apache Kafka’s Streams API  Kai Wähner, Confluent  Pizza, Beer and Networking
  • 5. Disaster Recovery in the Hadoop Ecosystem Preparing for the Improbable Stefan Kupstaitis-Dunkler, 2017-09-06
  • 6. Hadoop Disaster Recovery: Agenda  Foundations, Considerations and Terminology  Disaster Recovery Solution Scenarios  DR Capabilities of Selected Services  HDFS  Hive  HBase  Kafka  Policies and Configuration
  • 7. Disasters you want to be prepared for  Human  Failure.  Malicious Intent. (Hackers,…)  Machine  Failure.  Malicious Intent.  Catastrophes  Nature  Malicious Intent  Business Services Interruption  Data Loss  Data Theft caused by
  • 8. Disasters you want to be prepared for  Human  Failure.  Malicious Intent.  Machine  Failure.  Malicious Intent.  not yet…  Catastrophes  Nature  Malicious Intent  Business Services Interruption  Data Loss  Data Theft caused by
  • 9. Disasters you want to be prepared for  Human  Failure.  Malicious Intent.  Machine  Failure.  Malicious Intent.  Catastrophes (both cause machine failure  Nature  Malicious Intent (Terrorism,…)  Business Services Interruption  Data Loss  Data Theft caused by
  • 10. But I thought data is pretty safe in Hadoop? The bare minimum:  Data replication across nodes (HDFS, Kafka, Solr,…)  Rack awareness  Services High Availability (HA)  Acking, guaranteed processing, handshakes,…  Fine grained access control (Apache Knox + Ranger)  Monitoring + Alerting (Apache Ambari)  Cybersecurity (Apache Metron)
  • 11. Let’s talk about apples and pears: Terms and Definitions  Disaster Recovery vs. Disaster Prevention  Full/Delta Backup  Replication  Snapshots
  • 12. Considerations  Why DR?  Disaster Prevention…  List your data sources? What’s the impact of their loss?  Backup frequency?  Recovery speed?  Choose backup/replication mechanism that fits your business requirements and your data  Organize Prioritize Generalize!
  • 14. There are two disaster recovery solution scenarios… Data Center 1 Data Sources Data Center 2 Data Center 1 Data Sources Data Center 2 Dual Path Cluster Replication
  • 15. … and these are their differences Dual Path  Same ingest pipeline  Data is identically processed  Data is identically stored  Two active clusters  All data equally available in both clusters/data centers  Needs double resources  Applications can switch between both clusters (Piloting features, serving different geographical regions,…) Cluster Replication  Data is ingested in one cluster  Data is processed in one cluster  Several jobs (DR processes) are running to keep the other cluster up-to-date  Choose which data you want to secure (all or parts of it)  Needs less processing resources  Replication/Sync jobs need to be developed  Both clusters can be used for different work loads and applications.
  • 16.  Hive = HDFS + Metadata  Apply HDFS methods  Backup/replicate relational DB  Distcp  command line tool  Scheduling via Oozie or Falcon (deprecated)  Transfer of encrypted data either decrypt/encrypt or raw  HDFS Snapshots Disaster Recovery/Prevention Options for HDFS and Apache Hive ( )
  • 17. Disaster Recovery for Apache HBase  CopyTable: MapReduce online (table to table)  Exports: MapReduce online (table to HDFS)  Replication: near real time cluster sync  Snapshots  HBase Backup: offline Performance Impact ( ) ( ) ( )
  • 18.  Ambari:  Cluster config usually not changed often.  Compare automatically (via Ambari REST interface), sync manually   Ranger:  Security policies can be imported/exported as JSON   Policies can also be automatically synced utilizing the Ranger REST interface  Security Audits are stored in HDFS  can be backup up via distcp Configs and Policies: Ambari + Ranger
  • 19. Kafka Mirror Maker Kafka Cluster 1 Mirror Maker Kafka Cluster 2 … … …  Mirror Maker is a service included in Apache Kafka  It acts as a consumer of Kafka cluster 1  and as a producer to Kafka cluster 2  Just prepare 2 configuration files (consumer config + props of cluster 1 and producer config + props of cluster 2)  Start it with a simple start command ./kafka-run-class.sh kafka.tools.MirrorMaker --consumer.config consumer.properties --num.streams 2 --producer.config producer.properties --whitelist="test_topic,test_topic2,test_topic3"
  • 20. Now what?  Test your scenario…  Regularly!  Implement automated tests to track performance and completeness of data.  Idea: Let’s kill random services in production and see what happens. After all, Hadoop services and applications can handle such situations, plus now we have a DR strategy in place, that we trust. Do we?
  • 21. Thanks for your attention! Questions?

Editor's Notes

  1. On this picture: The first elephant in vienna This picture has obviously nothing to do with Hadoop…. What it has in common with Hadoop is that it arrived pretty late, but it arrived eventually.
  2. I always try to have 1-2 talks, or hands on sessions….
  3. You could categorize. Natural catastrophes as machine failures,…. and terrorism as human malicious intent
  4. Replication to three nodes for 2 reasons: prevention of data loss on commodity hardware parallel processing across multiple jobs high availability of data to services --> Single point of failures Rack awareness helps if one rack breaks, but not if the entire data center collapses High availability of services has nothing to do with data security in general… Everybody Well nobody prevents you from falling asleep on your keyboard accidently typing….. rm –r * And we still got the terrorist going and entire data center failure….
  5. It’s important that we speak about the same things. Full backup: All data from beginning of time until now Delta backup: All data from a certain time until another certain point in time; a bunch of delta backups can be summed up to get to a full backup. Replication: Replication in computing involves sharing information so as to ensure consistency between redundant resources; Snapshots: contains meta information about a specific state of a database at a certain point in time Disaster Recovery Prevention: Backup differs from replication in that it saves a copy of data unchanged for a long period of time. Replicas, on the other hand, undergo frequent updates and quickly lose any historical state. Replication is one of the oldest and most important topics in the overall area of distributed systems.
  6. Why do I need to have a DR plan? if it’s not important, then is my data important at all? Which data sources do I want to back up? Maintain a list of all databases, storage types, tables, topics, directories, files, shards Business critical data in HDFS, Hive, HBase, Kafka, Solr,… Meta Data: Service Configurations, Schemas, Security Audits, Security Policies How often do you want to back up or replicate? Prioritize! How fast do you want to be able to recover? How do you want to backup? (each service has different mechanisms) Organize and Genralize: Don’t implement a replication job for each and every data source In the future and you might have hundreds of data sources. So try to organize your data sources. Try to generalize the DR processes for each data source
  7. Dual Path: All data sources are connected to both clusters in both data centers
  8. Disaster Recovery Scenarios If HA of services is not sufficient, DR scenarios and a replica cluster containing the same data need to be established. In general, there are two main DR strategies, “dual path” and “replicate”. In the dual path strategy, the same data from all input feeds is ingested into both clusters, identically processed and stored. The disadvantage of double-processing is equalized by the advantage of having a minimal time delay between the clusters. In the replicate strategy, data is ingested and processed in one cluster and only relevant data, e.g., results and important sources, are replicated in a process that is either batch in certain intervals or real-time – depending on the amounts of data, the kind of service in use and the business requirements.   This decision affects the choice of methods described in the following sections.
  9. Distcp: useful Snapshots: Prevention: Hive = HDFS + Metadata Apply HDFS methods Backup/Replicate relational DB
  10. CopyTable is a MapReduce job that copies the contents of a table online into another table of the same or another cluster. The performance impact is high. CopyTable is not incremental by default, but the jobs can be configured to copy only parts of the table based on a start and end time. Changes in the table could be potentially missed. Exports are also MapReduce jobs, that write data into HDFS. From there they can be copied to another cluster using distcp and imported using the Import tool. The performance impact is high. Exports are not incremental by default, but the jobs can be configured to export only parts of the table based on a start and end time. Changes in the table could be potentially missed. Replication is a way to keep two clusters in sync in near real time. The performance overhead is low and no data can be missed. A big pitfall of replication is, that user or application errors corrupting data cannot be undone. Documentation: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_hadoop-ha/content/cluster-repl-x-dc.html Snapshots are incremental, but are usually used for creating checkpoints. They can be used to restore a certain state of the data at a certain point in time. Documentation: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_hbase_snapshots_guide/content/ch_hbase_snapshots_chapter.html Not feasible for this disaster recovery scenario, but to complete this list, HBase Backup should be mentioned. The full backup requires a shutdown of the Region Servers. Then a distcp action could be performed to transfer the data to cluster 2. The advantage is, that no changes on the table can be missed.
  11. The cluster configurations should not be replicated automatically for two reasons: Automating might consume more time than manually configuring the few changes to be expected on both clusters. Wrong configurations caused by human errors should not be propagated to the DR cluster automatically. To keep the configurations in sync Hortonworks provides a Python script “compare-clusters.py” that creates a report in HTML format. This report shows all configurations of all services installed on both clusters and marks differing values. Ranger security policies can be exported and imported manually as files starting from version 0.7 included in HDP 2.6.x. In earlier versions, it is a valid approach, to use the Ranger REST interface to GET all the policies of all services of one cluster, DELETE all policies of the cluster 2 and POST the policies of cluster 1 adapted to the cluster name of cluster 2. An set of scripts of how tools could look like, which could also be used as a base for further development efforts can be found in the following repository in the subdirectory “rangercli”: https://github.com/Condla/dr-tools
  12. https://github.com/Condla/protocols/blob/master/manuals/17-06-02_KAFKA_mirror_maker_example_configuration.md
  13. Setting up tests is a strict requirement. A DR setup does only make sense if it is tested or responsible persons are trained. Usually, test applications are scheduled in the cluster that are comparable to other productive applications. These test applications can be used to collect, save and analyze cluster and job/application specific metrics. These applications should have a well-known and easily reproducible input and output to make it easier to check data for completeness, especially if DR is tested. There will always be application specific test scenarios, thus they need to be developed with the application. However, there are a few generic test cases that should be tested on all clusters: Test HDFS is down. (a datanode, a namenode, full service) Test another service is down, that is crititcal to applications running in the cluster. Test all services are down. Test all services of one node are down.