Backup and Disaster Recovery in Hadoop

Backup and DR in Hadoop
Lars George – Partner and Co-Founder @ OpenCore
DataWorks Summit Munich 2017
Distributed Problems

About Me
• Partner & Co-Founder at OpenCore
• Before that
• Lars: EMEA Chief Architect at Cloudera (5+ years)
• Hadoop since 2007
• Apache Committer & Apache Member
• HBase (also in PMC)
• Lars: O’Reilly Author: HBase – The Definitive Guide
• Contact
• lars.george@opencore.com
• @larsgeorge
Website: www.opencore.com

Agenda
• Context
• Data Backup Strategies
• Summary

Context
What do you have to look out for?

What is What?
• Backup
• Ability to restore data using previously taken, frozen in time data snapshots
• Allows to recover deleted, or erroneously modified data
• Usually backups are not current, as the most recent is not included
• Disaster Recovery (DR)
• Restore business and operations after a complete system failure
• Includes rebuilding the environment and restoring the data from the last (good)
backup
• Minimize the impact on the business (financial loss)

Goals and Objectives
Usually backup and DR is grounded into conditions:
RTO – Recovery Time Objective
• Time to recover a service
• The hotter backup data is kept, the
shorter the RTO
• At scale, the RTO is foremost a
factor of infrastructure
RPO – Recovery Point Objective
• Measures how much data is lost in
case of a disastrous failure
• The more often data is backed up,
the shorter the RPO
 The RPO and RTO are driving cost factors and are multiplied by each other

Many Systems
• Hadoop is a platform of many distributed
systems
• Simple tools only cover simple topics
• Every system has data and/or meta data
• Amount of data ranges from a few terabytes
to multiple petabytes in practice
• A cluster contains few to hundreds of servers
 What do you back up, how often, and how?

2006 2008 2009 2010 2011 2012 2013
Core Hadoop
(HDFS,
MapReduce)
HBase
ZooKeeper
Solr
Pig
Core Hadoop
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
Core Hadoop
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
Core Hadoop
Flume
Bigtop
Oozie
HCatalog
Hue
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
YARN
Core Hadoop
Spark
Tez
Impala
Kafka
Drill
Flume
Bigtop
Oozie
HCatalog
Hue
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
YARN
Core Hadoop
Parquet
Sentry
Spark
Tez
Impala
Kafka
Drill
Flume
Bigtop
Oozie
HCatalog
Hue
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
YARN
Core Hadoop
The stack evolves and grows continuously!
2007
Solr
Pig
Core Hadoop
Knox
Flink
Parquet
Sentry
Spark
Tez
Impala
Kafka
Drill
Flume
Bigtop
Oozie
HCatalog
Hue
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
YARN
Core Hadoop
2014 2015
Kudu
RecordService
Ibis
Falcon
Knox
Flink
Parquet
Sentry
Spark
Tez
Impala
Kafka
Drill
Flume
Bigtop
Oozie
HCatalog
Hue
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
YARN
Core Hadoop
Evolution of the Hadoop Platform

Why is backing up data difficult?
• Data at scale is difficult to move around!
• You cannot cheat physics
• The sheer inertia of data requires new approaches
• Do not or only minimally move data as necessary
• If duplicated data, use it for other purposes as well?
• Multiple clusters with different workloads (Random Access vs. Analytics)
• Traditional backup tools often require standardized APIs
• Hadoop does not supply those necessarily, or they are inefficient here
• Included backup tools in Hadoop are often rudimentary
• Not all scenarios are covered, or are only partially covered

Failure Scenarios
• Node Degradation
• One or more nodes are slowing down or produce an increasing number of errors
(and with it fewer results) – coined “The John Wayne”
• Mayb cause byzantine errors, which are difficult to identify
 Reasons: Failures or bugs in disks, NICs, device drivers, software
 Hadoop can handle many such errors, but not all
• Partial Node Failure
• Single (redundant) components are failing completely
• Example: A disk stops working
• Operators can swap component at runtime
 Hadoop is built to handle failures like this
 Impact is restricted to the share of component on total capacity

Failure Scenarios (cont.)
• Node Failure
• Assumes preparation, like enabling HA everywhere or configure „Rack Awareness“
 Reasons: Power or network outage
 Hadoop can handle this just fine
• Network Partitioning
• The cluster is split into two or more parts at random points
• Causes the so-called „split brain“ problem, where each now autonomous part has to
decide if it must fail, or can continue to serve request
• Applications need to switch to one of the working parts of the cluster
 Hadoop has some support for that, but there are external dependencies
 What happens when the parts join the cluster again?

Failure Scenarios (cont.)
• Loss of an entire data center
• Complete loss of a data copy
• Either switch to a warm/hot standby cluster (blue-green deployment)
• Or, rebuild cluster and restore data
 Reasons: Power or network outage
 Has to be done outside of Hadoop

Data Sources
• Not all Hadoop components have persistent data (or metadata)
• Transient data can (should) be recomputed as needed
• The number of used Hadoop components varies a lot
• „Onboarding“ checklist can help to capture that
• Given a set of requirements the RTO and RPO can be different
• Question: How long does re-computing derived data take?
• Basic Rule: The more you have, the more costly and time consuming it is
• You can always omit parts, as long as everyone is OK with it (for realz!)
• Cost can be capped – but not without consequence (higher RTO)

Databases in Hadoop
• Many components use databases to store their state and metadata for
persistency
• The selection of RDBMS may have a substantial impact on that functionality
 Never use the ”developer option” (e.g. Derby)!
 The RDBMS should be highly available (HA)
• Databases should be backed up and archived on a regular basis
• But the question often remains: Is this a task of the Hadoop team or the
(often central) IT department?
• This also applies to other, external Hadoop stack systems (e.g. Storm)
If possible, delegate to experienced IT team, outside of Hadoop

Data Types
 There are two main types of data: persisted data and metadata
 There is also transient data
• Data concerns all user data, stored in HDFS, HBase, Solr, and so on
• Can be accessed using an interface
• Metadata are auxiliary information, helping to make sense of or being to
access the user data
• Hive Schemas
• Cluster Information
• Transient data often is stored in temporary files, logs, or streams

Data Consistency
• An often missed (or ignored?) topic, describing what actually is inside a backup
• Is the contained data consistent in itself?
• Some components (NoSQL, including HDFS) cannot mark data across system
boundaries in a reliable and predictable manner
• Snapshots may also be of no help as they are taken asynchronously
• Per regions server in HBase
• Open blocks are added in HDFS
• Move the task towards the application
• Which application was design to do that?
• When restoring data, gaps or bulges can form!
• Question is: Who is responsible to handle that?
• You could be tempted to add transactions...

Onboarding Checklist
• Ask what is needed
• How much data?
• How long is retention?
• Where is the data?
• How often?
• Define clear boundaries
• What is RTO and RPO?
Have user confirm and sign
off explicitly!

Backup Approaches
• Replication
• Copy of data and modifications of one cluster to another
• Some components in Hadoop support this (partially?)
• HBase in near real-time, while HDFS as batch job (distcp tool)
• For HDFS: Basically like the venerable rsync problem
• What do you do with deleted data? How to bootstrap process?
• Snapshots
• Few tools have a built-in snapshot feature
• HDFS and HBase
• Special access to frozen-in-time data
• Using special paths or system tools
• Data is local and needs to be moved
• How do you do this incrementally?

Backup Approaches (cont.)
• Classic Backup
• Store of data to a cold media
• Not supplied with Hadoop
• A few tools have system tools
• But… Versioned? Complete? Consistent?
• HA and Rack-Awareness
• Does neither cover backup nor DR
• Unless calling the HDFS trash functionality a backup... NOPE!
• Only valid within the cluster, within the same data center

Backup Validation
• After taking a backup, its integrity needs to be checked
• Should consistency also be verified?
• HDFS has typical checks like CRCs
• Database could be restored and checked
• Special test scripts?
• Applications should ideally supply their own verification tools or rule sets
• Make this part of the software engineering task
• Use Jenkins CI as a backup und restore pipeline?

So far…
• Backup is a combination of already available techniques, or a special
implementation for systems that have no native support
• Snapshots alone only offer local versioning
• Replication is either a hot mirror, or a set of raw data structures that do not
allow an instantaneous restoration
• Consistency has to be handled on the application side
• The required RTO und RPO is crucial for how cluster environments have to
be built, and should be considered from the get go
• RTO and RPO varies based on source and chosen backup strategy!
• There does not seem to be a complete solution, requiring special
implementations 

Backup Architectures
Practical scenarios (there are many more!)

Architecture #1 – Export
Data
Export
Cost
Latency
Performance
RTO
RPO
Concept
• Application writes into and reads from a single cluster
• Export of data to a dedicated storage service
• Cheap storage arrays
• Cloud storage systems (e.g. AWS S3)
• Scheduled to run as a batch job on a regular basis
Strength Weakness
+ Known architecture - Commonly slow (throttled WAN speed)
+ Can handle any data type (data & metadata) - Data (possibly) inaccessible unless restored first
+ Cost effective - High RTP and RPO
Cluster A Export StorageAnwendungAnwendungApplication
💵

Architecture #2 – Replication
Data
Replication
Cost
Latency
Performance
RTO
RPO
Concept
• Application writes into and reads from a single cluster
• Replication of data to a standby cluster
• (Possibly) smaller backup cluster with more storage and fewer CPUs
• Dependent on source can run constantly or as a batch job on a regular basis
Strength Weakness
+ Use of built-in replication (where available) - Can handle only some data types
+ Data accessible on backup cluster - Smaller backup cluster cannot handle all workloads
+ Performance a factor of parallelization - RTO and RPO depend on source
Cluster A ReplicationAnwendungAnwendungApplication
💵 💵
Cluster
B

Architecture #3 – Fan-out Writes
Fan-Out
Writes
Cost
Latency
Performance
RTO
RPO
Concept
• Application writes into and reads from two (or more) clusters at the same time
• Clusters are of same size and capacity, fan-out handled by application
• Could use tools like Kafka, combined with customer (or commercial) middle-ware
• ACK requires for both clusters to confirm the write
• Consistency could be controlled by application (see Google Spanner and TrueTime)
Strength Weakness
+ Clusters are independent and active-active - Highest cost
+ Lowest RTO and RPO - Complexity on application level
+ Application has full control - Validation is difficult
+ Can be enhanced using other tools
💵 💵 💵
Cluster A
AnwendungAnwendungApplication
Cluster B

Impact on Business
• The basic scenarios
are quite the
opposites when it
comes to RTO and
RPO
• Cost varies greatly,
with #3 requiring
two (or more) same
size clusters
In practice, any of
these scenarios can
be seen
RTO
RPO
HighLow
Low High
1
2
3

Summary
Where to go from here?

Backup Implementation
• Oozie Workflows
• Main workflow that branches into sub-workflows dependent
on types
• Dedicated sub-workflow for each possible source
• RDBMS, HBase, HDFS, Ambari/CM API, etc.
• Configuration through properties files
• Parameterize everything to reuse flows
• Use settings to branch inside the flows
• Initially create timestamp and format
output directory name per run
• Can be scheduled as needed

Summary
Backup and DR must be part of planning and procurement from the start
Many systems handle data differently, requiring special treatment
Data backup and restoration has to be handled by the applications
Commercial offerings are few and not fully featured

Backup and Disaster Recovery in Hadoop

In this document