Disaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable

8th Hadoop User Group Vienna
@ T-Mobile Austria
September 06, 2017

Hadoop User Group Vienna: Organizer
@StefanDunkler
 Senior Consultant at Hortonworks
 Technical Physics at Vienna UT
 Loves Open Source Technology
https://blog.datahovel.com https://github.com/condla

Hadoop User Group Vienna: Introduction
 Network and Connect to
real Hadoop Users!
 Present technical problems
or solutions!
 Meet the experts!
 Share problems and/or
solutions
 Have a good time! 
“The first elephant in Vienna”
(https://www.wien.gv.at/wiki/index.php/Elefant

Hadoop User Group Vienna: Agenda
 “Disaster Recovery in the Hadoop Ecosystem:
Preparing for the Improbable”
 Stefan Kupstaitis-Dunkler, Hortonworks
 Highly Scalable Machine Learning and Deep
Learning in Real Time with Apache Kafka’s Streams
API
 Kai Wähner, Confluent
 Pizza, Beer and Networking

Disaster Recovery in the Hadoop
Ecosystem
Preparing for the Improbable Stefan Kupstaitis-Dunkler, 2017-09-06

Hadoop Disaster
Recovery: Agenda
 Foundations, Considerations and
Terminology
 Disaster Recovery Solution
Scenarios
 DR Capabilities of Selected
Services
 HDFS
 Hive
 HBase
 Kafka
 Policies and Configuration

Disasters you want to be prepared for
 Human
 Failure.
 Malicious Intent. (Hackers,…)
 Machine
 Failure.
 Malicious Intent.
 Catastrophes
 Nature
 Malicious Intent
 Business Services Interruption
 Data Loss
 Data Theft
caused by

 Human
 Failure.
 Machine
 Failure.
 Malicious Intent.  not yet…
 Catastrophes
 Nature
 Malicious Intent
 Data Loss
 Data Theft
caused by

 Human
 Failure.
 Machine
 Failure.
 Catastrophes (both cause machine failure
 Nature
 Malicious Intent (Terrorism,…)
 Data Loss
 Data Theft
caused by

But I thought data is pretty safe in Hadoop?
The bare minimum:
 Data replication across nodes (HDFS, Kafka, Solr,…)
 Rack awareness
 Services High Availability (HA)
 Acking, guaranteed processing, handshakes,…
 Fine grained access control (Apache Knox + Ranger)
 Monitoring + Alerting (Apache Ambari)
 Cybersecurity (Apache Metron)

Let’s talk about apples and pears: Terms and
Definitions
 Disaster Recovery vs. Disaster Prevention
 Full/Delta Backup
 Replication
 Snapshots

Considerations
 Why DR?  Disaster Prevention…
 List your data sources? What’s the impact of their loss?
 Backup frequency?
 Recovery speed?
 Choose backup/replication mechanism that fits your business
requirements and your data
 Organize Prioritize Generalize!

There are two disaster recovery solution
scenarios…
Data Center
1
Data
Sources
Data Center
2
Data Center
1
Data
Sources
Data Center
2
Dual Path Cluster Replication

… and these are their differences
Dual Path
 Same ingest pipeline
 Data is identically processed
 Data is identically stored
 Two active clusters
 All data equally available in both clusters/data
centers
 Needs double resources
 Applications can switch between both clusters
(Piloting features, serving different geographical
regions,…)
Cluster Replication
 Data is ingested in one cluster
 Data is processed in one cluster
 Several jobs (DR processes) are running to keep
the other cluster up-to-date
 Choose which data you want to secure (all or
parts of it)
 Needs less processing resources
 Replication/Sync jobs need to be developed
 Both clusters can be used for different work loads and applications.

 Hive = HDFS + Metadata
 Apply HDFS methods
 Backup/replicate relational DB
 Distcp
 command line tool
 Scheduling via Oozie or Falcon (deprecated)
 Transfer of encrypted data either
decrypt/encrypt or raw
 HDFS Snapshots
Disaster Recovery/Prevention Options for
HDFS and Apache Hive
( )

Disaster Recovery for Apache HBase
 CopyTable: MapReduce online (table to table)
 Exports: MapReduce online (table to HDFS)
 Replication: near real time cluster sync
 Snapshots
 HBase Backup: offline
Performance
Impact
( )
( )
( )

 Ambari:
 Cluster config usually not changed often.
 Compare automatically (via Ambari REST interface), sync manually 
 Ranger:
 Security policies can be imported/exported as JSON 
 Policies can also be automatically synced utilizing the Ranger REST interface
 Security Audits are stored in HDFS  can be backup up via distcp
Configs and Policies: Ambari + Ranger

Kafka Mirror Maker
Kafka Cluster 1
Mirror Maker
Kafka Cluster 2
…
…
…
 Mirror Maker is a service included in Apache Kafka
 It acts as a consumer of Kafka cluster 1
 and as a producer to Kafka cluster 2
 Just prepare 2 configuration files (consumer config +
props of cluster 1 and producer config + props of
cluster 2)
 Start it with a simple start command
./kafka-run-class.sh kafka.tools.MirrorMaker --consumer.config consumer.properties
--num.streams 2 --producer.config producer.properties
--whitelist="test_topic,test_topic2,test_topic3"

Now what?
 Test your scenario…
 Regularly!
 Implement automated tests to track performance and
completeness of data.
 Idea: Let’s kill random services in production and see what
happens. After all, Hadoop services and applications can handle
such situations, plus now we have a DR strategy in place, that we
trust. Do we?

Thanks for your attention!
Questions?

Disaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable

More Related Content

What's hot

Similar to Disaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable

Recently uploaded

Disaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable

Editor's Notes