SlideShare a Scribd company logo
1 of 27
Backup and DR in Hadoop
Lars George – Partner and Co-Founder @ OpenCore
DataWorks Summit Munich 2017
Distributed Problems
About Me
• Partner & Co-Founder at OpenCore
• Before that
• Lars: EMEA Chief Architect at Cloudera (5+ years)
• Hadoop since 2007
• Apache Committer & Apache Member
• HBase (also in PMC)
• Lars: O’Reilly Author: HBase – The Definitive Guide
• Contact
• lars.george@opencore.com
• @larsgeorge
Website: www.opencore.com
Agenda
• Context
• Data Backup Strategies
• Summary
Context
What do you have to look out for?
What is What?
• Backup
• Ability to restore data using previously taken, frozen in time data snapshots
• Allows to recover deleted, or erroneously modified data
• Usually backups are not current, as the most recent is not included
• Disaster Recovery
• Restore business and operations after a complete system failure
• Includes rebuilding the environment and restoring the data from the last (good)
backup
• Minimize the impact on the business (financial loss)
Many Systems
• Hadoop is a platform of many distributed
sytems
• Simple tools only cover simple topics
• Every system has data and/or meta data
• Amount of data ranges from a few terabytes
to multiple petabytes in practice
• A cluster contains few to hundreds of servers
 What do you back up, how often, and how?
2006 2008 2009 2010 2011 2012 2013
Core Hadoop
(HDFS,
MapReduce)
HBase
ZooKeeper
Solr
Pig
Core Hadoop
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
Core Hadoop
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
Core Hadoop
Flume
Bigtop
Oozie
HCatalog
Hue
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
YARN
Core Hadoop
Spark
Tez
Impala
Kafka
Drill
Flume
Bigtop
Oozie
HCatalog
Hue
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
YARN
Core Hadoop
Parquet
Sentry
Spark
Tez
Impala
Kafka
Drill
Flume
Bigtop
Oozie
HCatalog
Hue
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
YARN
Core Hadoop
The stack evolves and grows continuously!
2007
Solr
Pig
Core Hadoop
Knox
Flink
Parquet
Sentry
Spark
Tez
Impala
Kafka
Drill
Flume
Bigtop
Oozie
HCatalog
Hue
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
YARN
Core Hadoop
2014 2015
Kudu
RecordService
Ibis
Falcon
Knox
Flink
Parquet
Sentry
Spark
Tez
Impala
Kafka
Drill
Flume
Bigtop
Oozie
HCatalog
Hue
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
YARN
Core Hadoop
Evolution of the Hadoop Platform
Why is backing up data difficult?
• Data at scale is difficult to move around!
• You cannot cheat physics
• The sheer inertia of data requires new approaches
• Do not or only minimally move data as necessary
• If duplicated data, use it for other purposes as well
• Multiple clusters with different workloads
• Traditional backup tools often require standardized APIs
• Hadoop does not supply those necessarily, or they are inefficient here
• Included backup tools in Hadoop are often rudimentary
• Not all scenarios are covered, or are only partially covered
Databases in Hadoop
• Many components use databases to store their state and metadata for
persistency
• The selection of RDBMS may have a substantial impact on that functionality
 Never use the ”developer option” (e.g. Derby)!
 The RDBMS should be highly available (HA)
• Databases should be backed up and archived on a regular basis
• But the question often remains: Is this a task of the Hadoop team or the
(often central) IT department?
• This also applies to other, external Hadoop stack systems (e.g. Storm)
Goals and Objectives
Usually backup and DR is grounded into conditions:
RTO – Recovery Time Objective
• Time to recover a service
• The hotter backup data is kept, the
shorter the RTO
• At scale, the RTO is foremost a
factor of infrastructure
RPO – Recovery Point Objective
• Measures how much data is lost in
case of a disastrous failure
• The more often data is backed up,
the shorter the RPO
 The RPO and RTO are driving cost factors and are multiplied by each other
Failure Scenarios
• Node Degradation
• One or more nodes are slowing down or produce an increasing number of errors
(and with it fewer results) – coined “The John Wayne”
• Mayb cause byzantine errors, which are difficult to identify
 Reasons: Failures or bugs in disks, NICs, device drivers, software
 Hadoop can handle many such errors, but not all
• Partial Node Failure
• Single (redundant) components are failing completely
• Example: A disk stops working
• Operators can swap component at runtime
 Hadoop is built to handle failures like this
 Impact is restricted to the share of component on total capacity
Failure Scenarios (cont.)
• Node Failure
• Assumes preparation, like enabling HA everywhere or configure „Rack Awareness“
 Reasons: Power or network outage
 Hadoop can handle this just fine
• Network Partitioning
• The cluster is split into two or more parts at random points
• Causes the so-called „split brain“ problem, where each now autonomous part has to
decide if it must fail, or can continue to serve request
• Applications need to switch to one of the working parts of the cluster
 Hadoop has some support for that, but there are external dependencies
 What happens when the parts join the cluster again?
Failure Scenarios (cont.)
• Loss of an entire data center
• Complete loss of a data copy
• Either switch to a warm/hot standby cluster (blue-green deployment)
• Or, rebuild cluster and restore data
 Reasons: Power or network outage
 Has to be done outside of Hadoop
Data Sources
• Not all Hadoop components have persistent data (or metadata)
• Transient data can (should) be recomputed as needed
• The number of used Hadoop components varies a lot
• „Onboarding“ checklist can help to capture that
• Given a set of requirements the RTO and RPO can be different
• Question: How long does re-computing derived data take?
• Basic Rule: The more you have, the more costly and time consuming it is
• You can always omit parts, as long as everyone is OK with it (for realz!)
• Cost can be capped – but not without consequence (higher RTO)
Backup Strategies
• Replication
• Copy of data and modifications of one cluster to another
• Basically like the venerable rsync problem
• Some components in Hadoop support this (partially?)
• What do you do with deleted data?
• Snapshots
• Few tools have a built-in snapshot feature
• HDFS and HBase
• Special access to frozen-in-time data
• Using special paths or system tools
• Data is local and needs to be moved
• How do you do this incrementally?
Bakup Strategies (cont.)
• Backup
• Store of data to a cold media
• Not supplied with Hadoop
• A few tools have system tools
• But… Versioned? Complete? Consistent?
• HA and Rack-Awareness
• Does neither cover backup nor DR
• Unless calling the HDFS trash functionality a backup... NOPE!
• Only valid within the cluster, within the same data center
Data Types
 There are two main types of data: persisted data and metadata
 There is also transient data
• Data concerns all user data, stored in HDFS, HBase, Solr, and so on
• Can be accessed using an interface
• Metadata are auxiliary information, helping to make sense of or being to
access the user data
• Hive Schemas
• Cluster Information
• Transient data often is stored in temporary files, logs, or streams
Data Consistency
• An often missed (or ignored?) topic, describing what actually is inside a backup
• Is the contained data consistent in itself?
• Some components (NoSQL, including HDFS) cannot mark data across system
boundaries in a reliable and predictable manner
• Snapshots may also be of no help as they are taken asynchronously
• Per regions server in HBase
• Open blocks are added in HDFS
• Move the task towards the application
• Which application was design to do that?
• When restoring data, gaps or bulges can form!
• Question is: Who is responsible to handle that?
• You could be tempted to add transactions...
Validation
• After taking a backup, its integrity needs to be checked
• Should consistency also be verified?
• HDFS has typical checks like CRCs
• Database could be restored and checked
• Special test scripts?
• Applications should ideally supply their own verification tools or rule sets
• Make this part of the software engineering task
• Use Jenkins CI as a backup und restore pipeline?
So far…
• Backup is a combination of already available techniques, or a special
implementation for systems that have no native support
• Snapshots alone only offer local versioning
• Replication is either a hot mirror, or a set of raw data structures that do not
allow an instantaneous restoration
• Consistency has to be handled on the application side
• The required RTO und RPO is crucial for how cluster environments have to
be built, and should be considered from the get go
• There does not seem to be a complete solution, requiring special
implementations!
Data Backup Strategies
Practical scenarios
Approaches
Cluster A Export
Scenario 1: Cold Export
StorageAnwendungAnwendungApplication
Cluster A Replication
Scenario 2: Replication
AnwendungAnwendungApplication Cluster B
Approaches (cont.)
• Applications write into two
(or more) clusters at the
same time
• Typically using Kafka
• ACK requires for both
clusters to confirm the write
• Could be controlled per
application
• See Google Spanner and
TrueTime
Cluster A
Scenario 3: Simultaneous Writes
AnwendungAnwendungApplication
Cluster B
Impact on Business
• The basic scenario
are polar opposites
• Depending on
complexity extra
layers can be added
• Kafka as a buffer
• Cost varies greatly,
with #3 requiring
two same size
clusters
RTO
RPO
HochNiedrig
Niedrig Hoch
1
23
Summary
Where to go from here?
Summary
Backup and DR must be part of planning and procurement from the start
Many systems handle data differently, requiring special treatment
Data backup and restoration has to be handled by the applications
Commercial offerings are few and not fully featured
Thank You!
@larsgeorge

More Related Content

What's hot

Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent
 
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storagehive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata StorageDataWorks Summit/Hadoop Summit
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?Kai Wähner
 
A visual introduction to Apache Kafka
A visual introduction to Apache KafkaA visual introduction to Apache Kafka
A visual introduction to Apache KafkaPaul Brebner
 
Ozone and HDFS's Evolution
Ozone and HDFS's EvolutionOzone and HDFS's Evolution
Ozone and HDFS's EvolutionDataWorks Summit
 
Best Practices For Workflow
Best Practices For WorkflowBest Practices For Workflow
Best Practices For WorkflowTimothy Spann
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Kafka Streams State Stores Being Persistent
Kafka Streams State Stores Being PersistentKafka Streams State Stores Being Persistent
Kafka Streams State Stores Being Persistentconfluent
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveDataWorks Summit
 
Improving fault tolerance and scaling out in Kafka Streams with Bill Bejeck |...
Improving fault tolerance and scaling out in Kafka Streams with Bill Bejeck |...Improving fault tolerance and scaling out in Kafka Streams with Bill Bejeck |...
Improving fault tolerance and scaling out in Kafka Streams with Bill Bejeck |...HostedbyConfluent
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream ProcessingGuido Schmutz
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache KafkaChhavi Parasher
 
Capacity Planning Your Kafka Cluster | Jason Bell, Digitalis
Capacity Planning Your Kafka Cluster | Jason Bell, DigitalisCapacity Planning Your Kafka Cluster | Jason Bell, Digitalis
Capacity Planning Your Kafka Cluster | Jason Bell, DigitalisHostedbyConfluent
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBill Liu
 
Building Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeBuilding Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeDatabricks
 

What's hot (20)

Hive tuning
Hive tuningHive tuning
Hive tuning
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storagehive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?
 
A visual introduction to Apache Kafka
A visual introduction to Apache KafkaA visual introduction to Apache Kafka
A visual introduction to Apache Kafka
 
Ozone and HDFS's Evolution
Ozone and HDFS's EvolutionOzone and HDFS's Evolution
Ozone and HDFS's Evolution
 
Best Practices For Workflow
Best Practices For WorkflowBest Practices For Workflow
Best Practices For Workflow
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Kafka Streams State Stores Being Persistent
Kafka Streams State Stores Being PersistentKafka Streams State Stores Being Persistent
Kafka Streams State Stores Being Persistent
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
 
Improving fault tolerance and scaling out in Kafka Streams with Bill Bejeck |...
Improving fault tolerance and scaling out in Kafka Streams with Bill Bejeck |...Improving fault tolerance and scaling out in Kafka Streams with Bill Bejeck |...
Improving fault tolerance and scaling out in Kafka Streams with Bill Bejeck |...
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
 
Capacity Planning Your Kafka Cluster | Jason Bell, Digitalis
Capacity Planning Your Kafka Cluster | Jason Bell, DigitalisCapacity Planning Your Kafka Cluster | Jason Bell, Digitalis
Capacity Planning Your Kafka Cluster | Jason Bell, Digitalis
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Building Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeBuilding Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta Lake
 

Similar to Backup and Disaster Recovery in Hadoop

Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadooplarsgeorge
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMichael Hiskey
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoopch adnan
 
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...Global Business Events
 
Planning For Catastrophe with IBM WAS and IBM BPM
Planning For Catastrophe with IBM WAS and IBM BPMPlanning For Catastrophe with IBM WAS and IBM BPM
Planning For Catastrophe with IBM WAS and IBM BPMWASdev Community
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016StampedeCon
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryCloudera, Inc.
 
DataIntensiveComputing.pdf
DataIntensiveComputing.pdfDataIntensiveComputing.pdf
DataIntensiveComputing.pdfBrahmam8
 
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightTillmann Eitelberg
 
Big data analysis using hadoop cluster
Big data analysis using hadoop clusterBig data analysis using hadoop cluster
Big data analysis using hadoop clusterFurqan Haider
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 

Similar to Backup and Disaster Recovery in Hadoop (20)

Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
 
Planning For Catastrophe with IBM WAS and IBM BPM
Planning For Catastrophe with IBM WAS and IBM BPMPlanning For Catastrophe with IBM WAS and IBM BPM
Planning For Catastrophe with IBM WAS and IBM BPM
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
 
DataIntensiveComputing.pdf
DataIntensiveComputing.pdfDataIntensiveComputing.pdf
DataIntensiveComputing.pdf
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsight
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Big data analysis using hadoop cluster
Big data analysis using hadoop clusterBig data analysis using hadoop cluster
Big data analysis using hadoop cluster
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 

More from DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 

Recently uploaded

Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 

Recently uploaded (20)

Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 

Backup and Disaster Recovery in Hadoop

  • 1. Backup and DR in Hadoop Lars George – Partner and Co-Founder @ OpenCore DataWorks Summit Munich 2017 Distributed Problems
  • 2. About Me • Partner & Co-Founder at OpenCore • Before that • Lars: EMEA Chief Architect at Cloudera (5+ years) • Hadoop since 2007 • Apache Committer & Apache Member • HBase (also in PMC) • Lars: O’Reilly Author: HBase – The Definitive Guide • Contact • lars.george@opencore.com • @larsgeorge Website: www.opencore.com
  • 3. Agenda • Context • Data Backup Strategies • Summary
  • 4. Context What do you have to look out for?
  • 5. What is What? • Backup • Ability to restore data using previously taken, frozen in time data snapshots • Allows to recover deleted, or erroneously modified data • Usually backups are not current, as the most recent is not included • Disaster Recovery • Restore business and operations after a complete system failure • Includes rebuilding the environment and restoring the data from the last (good) backup • Minimize the impact on the business (financial loss)
  • 6. Many Systems • Hadoop is a platform of many distributed sytems • Simple tools only cover simple topics • Every system has data and/or meta data • Amount of data ranges from a few terabytes to multiple petabytes in practice • A cluster contains few to hundreds of servers  What do you back up, how often, and how?
  • 7. 2006 2008 2009 2010 2011 2012 2013 Core Hadoop (HDFS, MapReduce) HBase ZooKeeper Solr Pig Core Hadoop Hive Mahout HBase ZooKeeper Solr Pig Core Hadoop Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig Core Hadoop Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop Spark Tez Impala Kafka Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop Parquet Sentry Spark Tez Impala Kafka Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop The stack evolves and grows continuously! 2007 Solr Pig Core Hadoop Knox Flink Parquet Sentry Spark Tez Impala Kafka Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop 2014 2015 Kudu RecordService Ibis Falcon Knox Flink Parquet Sentry Spark Tez Impala Kafka Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop Evolution of the Hadoop Platform
  • 8. Why is backing up data difficult? • Data at scale is difficult to move around! • You cannot cheat physics • The sheer inertia of data requires new approaches • Do not or only minimally move data as necessary • If duplicated data, use it for other purposes as well • Multiple clusters with different workloads • Traditional backup tools often require standardized APIs • Hadoop does not supply those necessarily, or they are inefficient here • Included backup tools in Hadoop are often rudimentary • Not all scenarios are covered, or are only partially covered
  • 9. Databases in Hadoop • Many components use databases to store their state and metadata for persistency • The selection of RDBMS may have a substantial impact on that functionality  Never use the ”developer option” (e.g. Derby)!  The RDBMS should be highly available (HA) • Databases should be backed up and archived on a regular basis • But the question often remains: Is this a task of the Hadoop team or the (often central) IT department? • This also applies to other, external Hadoop stack systems (e.g. Storm)
  • 10. Goals and Objectives Usually backup and DR is grounded into conditions: RTO – Recovery Time Objective • Time to recover a service • The hotter backup data is kept, the shorter the RTO • At scale, the RTO is foremost a factor of infrastructure RPO – Recovery Point Objective • Measures how much data is lost in case of a disastrous failure • The more often data is backed up, the shorter the RPO  The RPO and RTO are driving cost factors and are multiplied by each other
  • 11. Failure Scenarios • Node Degradation • One or more nodes are slowing down or produce an increasing number of errors (and with it fewer results) – coined “The John Wayne” • Mayb cause byzantine errors, which are difficult to identify  Reasons: Failures or bugs in disks, NICs, device drivers, software  Hadoop can handle many such errors, but not all • Partial Node Failure • Single (redundant) components are failing completely • Example: A disk stops working • Operators can swap component at runtime  Hadoop is built to handle failures like this  Impact is restricted to the share of component on total capacity
  • 12. Failure Scenarios (cont.) • Node Failure • Assumes preparation, like enabling HA everywhere or configure „Rack Awareness“  Reasons: Power or network outage  Hadoop can handle this just fine • Network Partitioning • The cluster is split into two or more parts at random points • Causes the so-called „split brain“ problem, where each now autonomous part has to decide if it must fail, or can continue to serve request • Applications need to switch to one of the working parts of the cluster  Hadoop has some support for that, but there are external dependencies  What happens when the parts join the cluster again?
  • 13. Failure Scenarios (cont.) • Loss of an entire data center • Complete loss of a data copy • Either switch to a warm/hot standby cluster (blue-green deployment) • Or, rebuild cluster and restore data  Reasons: Power or network outage  Has to be done outside of Hadoop
  • 14. Data Sources • Not all Hadoop components have persistent data (or metadata) • Transient data can (should) be recomputed as needed • The number of used Hadoop components varies a lot • „Onboarding“ checklist can help to capture that • Given a set of requirements the RTO and RPO can be different • Question: How long does re-computing derived data take? • Basic Rule: The more you have, the more costly and time consuming it is • You can always omit parts, as long as everyone is OK with it (for realz!) • Cost can be capped – but not without consequence (higher RTO)
  • 15. Backup Strategies • Replication • Copy of data and modifications of one cluster to another • Basically like the venerable rsync problem • Some components in Hadoop support this (partially?) • What do you do with deleted data? • Snapshots • Few tools have a built-in snapshot feature • HDFS and HBase • Special access to frozen-in-time data • Using special paths or system tools • Data is local and needs to be moved • How do you do this incrementally?
  • 16. Bakup Strategies (cont.) • Backup • Store of data to a cold media • Not supplied with Hadoop • A few tools have system tools • But… Versioned? Complete? Consistent? • HA and Rack-Awareness • Does neither cover backup nor DR • Unless calling the HDFS trash functionality a backup... NOPE! • Only valid within the cluster, within the same data center
  • 17. Data Types  There are two main types of data: persisted data and metadata  There is also transient data • Data concerns all user data, stored in HDFS, HBase, Solr, and so on • Can be accessed using an interface • Metadata are auxiliary information, helping to make sense of or being to access the user data • Hive Schemas • Cluster Information • Transient data often is stored in temporary files, logs, or streams
  • 18. Data Consistency • An often missed (or ignored?) topic, describing what actually is inside a backup • Is the contained data consistent in itself? • Some components (NoSQL, including HDFS) cannot mark data across system boundaries in a reliable and predictable manner • Snapshots may also be of no help as they are taken asynchronously • Per regions server in HBase • Open blocks are added in HDFS • Move the task towards the application • Which application was design to do that? • When restoring data, gaps or bulges can form! • Question is: Who is responsible to handle that? • You could be tempted to add transactions...
  • 19. Validation • After taking a backup, its integrity needs to be checked • Should consistency also be verified? • HDFS has typical checks like CRCs • Database could be restored and checked • Special test scripts? • Applications should ideally supply their own verification tools or rule sets • Make this part of the software engineering task • Use Jenkins CI as a backup und restore pipeline?
  • 20. So far… • Backup is a combination of already available techniques, or a special implementation for systems that have no native support • Snapshots alone only offer local versioning • Replication is either a hot mirror, or a set of raw data structures that do not allow an instantaneous restoration • Consistency has to be handled on the application side • The required RTO und RPO is crucial for how cluster environments have to be built, and should be considered from the get go • There does not seem to be a complete solution, requiring special implementations!
  • 22. Approaches Cluster A Export Scenario 1: Cold Export StorageAnwendungAnwendungApplication Cluster A Replication Scenario 2: Replication AnwendungAnwendungApplication Cluster B
  • 23. Approaches (cont.) • Applications write into two (or more) clusters at the same time • Typically using Kafka • ACK requires for both clusters to confirm the write • Could be controlled per application • See Google Spanner and TrueTime Cluster A Scenario 3: Simultaneous Writes AnwendungAnwendungApplication Cluster B
  • 24. Impact on Business • The basic scenario are polar opposites • Depending on complexity extra layers can be added • Kafka as a buffer • Cost varies greatly, with #3 requiring two same size clusters RTO RPO HochNiedrig Niedrig Hoch 1 23
  • 25. Summary Where to go from here?
  • 26. Summary Backup and DR must be part of planning and procurement from the start Many systems handle data differently, requiring special treatment Data backup and restoration has to be handled by the applications Commercial offerings are few and not fully featured

Editor's Notes

  1. The rapid expansion of the Hadoop ecosystem is further evidence of its meteoric adoption.