SlideShare a Scribd company logo
Hadoop Backup and Disaster
        Recovery
       Jai Ranganathan
         Cloudera Inc
What makes Hadoop different?

            Not much

            EXCEPT
    • Tera- to Peta-bytes of data
      • Commodity hardware
        • Highly distributed
     • Many different services
What needs protection?

  Data Sets:       Applications:       Configuration:
                        System            Knobs and
                   applications (JT,    configurations
Data & Meta-data
                     NN, Region        necessary to run
 about your data
                   Servers, etc) and     applications
     (Hive)
                   User applications
We will focus on….


              Data Sets
but not because the others aren’t important..

  Existing systems & processes can help
  manage Apps & Configuration (to some
                 extent)
Classes of Problems to Plan For
Hardware Failures
 • Data corruption on disk
 • Disk/Node crash
 • Rack failure


User/Application Error
 • Accidental or malicious data deletion
 • Corrupted data writes


Site Failures
 • Permanent site loss – fire, ice, etc
 • Temporary site loss – Network, Power, etc (more common)
Business goals must drive solutions
        RPOs and RTOs are awesome…
But plan for what you care about – how much is
               this data worth?
Failure mode          Risk           Cost

Disk failure          High           Low

Node failure          High           Low

Rack failure         Medium         Medium

Accidental deletes   Medium         Medium

Site loss             Low            High
Basics of HDFS*




          * From Hadoop documentation
Hardware failures – Data Corruption
  Data corruption on disk


 • Checksums metadata for each block stored
   with file
 • If checksums do not match, name node
   discards block and replaces with fresh copy
 • Name node can write metadata to multiple
   copies for safety – write to different file
   systems and make backups
Hardware Failures - Crashes
Disk/Node crash


• Synchronous replicationon disk day- first
     Data corruption saves the
  two replicas always on different hosts
• Hardware failure detected by heartbeat loss
• Name node HA for meta-data
• HDFS automatically re-replicates blocks
  without enough replicas through periodic
  process
Hardware Failures – Rack failure
 Rack failure


 • Configure corruption on diskprovide rack
       Data at least 3 replicas and
   information (
   topology.node.switch.mapping.impl or
   topology.script.file.name)
 • 3rd replica always in a different rack
 • 3rd is important – allows for time window
   between failure and detection to safely exist
Don’t forget metadata


   • Your data is defined by Hive metadata
• But this is easy! SQL backups as per usual for
                     Hive safety
Cool.. Basic hardware is under control
                   Not quite
      • Employ Monitoring to track node health
     • Examine data node block scanner reports
    (http://datanode:50075/blockScannerReport)
             • Hadoop fsck is your friend


Of course, your friendly neighborhood Hadoop vendor
  has tools – Cloudera Manager health checks FTW!
Phew.. Past the easy stuff
              One more small detail…

   Upgrades for HDFS should be treated with care
         On-disk layout changes are risky!

        • Save name node meta-data offsite
• Test upgrade on smaller cluster before pushing out
• Data layout upgrades support roll-back but be safe
• Making backups of all or important data to remote
               location before upgrade!
Application or user errors

                     Permissions scope
                  Users only have access to data they
                         must have access to
  Apply the
principle of
   least            Quota management
 privilege            Name quota: Limits number of
                             files rooted at dir
                      Space quota: Limit bytes of files
                                rooted at dir
Protecting against accidental deletes

                         Trash server
             When enabled, files are deleted into
                            trash
             Enable using fs.trash.interval to set
                        trash interval

                    Keep in mind:
• Trash deletion only works through fs shell –
  programmatic deletes will not employ Trash
• .Trash is a per user directory for restores
Accidental deletes – don’t forget
           metadata



  • Again, regular SQL backups is key
HDFS Snapshots
             What are snapshots?
Snapshots represent state of the system at a point
                    in time
Often implemented using copy-on-write semantics



• In HDFS, append-only fs means only deletes have
                  to be managed
   • Many of the problems with COW are gone!
HDFS Snapshots – coming to a distro
            near you

 Community is hard at work on HDFS snapshots
Expect availability in major distros within the year


    Some implementation details – NameNode
                  snapshotting:
         • Very fast snapping capability
            • Consistency guarantees
      • Restores need to perform data copy
• .snapshot directories for access to individual files
What can HDFS Snapshots do for you?


  • Handles user/application data corruption
         • Handles accidental deletes
   • Can also be used for Test/Dev purposes!
HBase snapshots

            Oh hello, HBase!
Very similar construct to HDFS snapshots
               COW model

               • Fast snaps
        • Consistent snapshots
      • Restores still need a copy
    (hey, at least we are consistent)
Hive metadata
   The recurring theme of data + meta-data

Ideally, metadata backed up in the same flow as the
                      core data
     Consistency of data and metadata is really
                     important
Management of snapshots
Space considerations:

• % of cluster for snapshots
• Number of snapshots
• Alerting on space issues

Scheduling backups:

• Time based
• Workflow based
Great… Are we done?

        Don’t forget Roger Duronio!

Principle of least privilege still matters…
Disaster Recovery


  Datacenter A              Datacenter B




HDFS   Hive   HBase
Teeing vs Copying
         Teeing                     Copying

                               Data is copied from
 Send data during ingest
                            production to replica as a
 phase to production and
                               separate step after
     replica clusters
                                   processing
• Time delay is minimal
                           • Consistent data
  between clusters
                             between both sites
• Bandwidth required
                           • Process once only
  could be larger
                           • Time delay for RPO
• Requires re-processing
                             objectives to do
  data on both sides
                             incremental copy
• No consistency between
                           • More bandwidth
  sites
                             needed
Recommendations?


       Scenario dependent
                But
Generally prefer copying over teeing
How to replicate – per service


HDFS                   HBase                                 Hive
       Teeing:
                               Teeing:
       Flume and                                                            Teeing:
                               Application
       Sqoop support                                                        NA
                               level teeing
       teeing


       Copying:
                               Copying:                                     Copying:
       DistCP for
       copying                 HBase                                        Database
                               replication                                  import/export*




                                              * Database import/export isn’t the full story
Hive metadata
   The recurring theme of data + meta-data

Ideally, metadata backed up in the same flow as the
                      core data
     Consistency of data and metadata is really
                     important
Key considerations for large data
                   movement
•   Is your data compressed?
     – None of the systems support compression on the wire natively
     – WAN accelerators can help but cost $$

•   Do you know your bandwidth needs?
     – Initial data load
     – Daily ingest rate – Maintain historical information

•   Do you know your network security setup?
     – Data nodes & Region Servers talk to each other – they need to be able to have network connectivity

•   Have you configured security appropriately?
     – Kerberos support for cross-realm trust is challenging

•   What about cross-version copying?
     – Can’t always have both clusters be same version – but this is not trivial
Management of replications
Scheduling replication jobs

• Time based
• Workflow based – Kicked off from Oozie script?

Prioritization

• Keep replications in a separate scheduler group and
  dedicate capacity to replication jobs
• Don’t schedule more map tasks than can handle
  available network bandwidth between sites
Secondary configuration and usage
Hardware considerations
• Denser disk configurations acceptable on remote site
  depending on workload goals – 4 TB disks vs 2 TB disks, etc
• Fewer nodes are typical – consider replicating only critical
  data. Be careful playing with replication factors

Usage considerations
• Physical partitioning means a great place for ad-hoc
  analytics
• Production workloads continue to run on core cluster but
  ad-hoc analytics on replica cluster
• For HBase, all clusters can be used for data serving!
What about external systems?

• Backing up to external systems is a 1 way
  street with large data volumes

• Can’t do useful processing on the other side

• Cost of hadoop storage is fairly low, especially
  if you can drive work on it
Summary
• It can be done!

• Lots of gotchas and details to track in the process

• We haven’t even talked about applications and
  configuration!

• Failure workflows are important too – testing,
  testing, testing
Cloudera Enterprise BDR

CLOUDERA ENTERPRISE
CLOUDERA MANAGER

         SELECT                   CONFIGURE                  SYNCHRONIZE                   MONITOR


                                          DISASTER RECOVERY MODULE



CDH



                   HDFS DISTRIBUTED REPLICATION                            HIVE METASTORE REPLICATION
                      HIGH PERFORMANCE REPLICATION                         THE ONLY DISASTER RECOVERY SOLUTION
                            USING MAPREDUCE                                           FOR METADATA

         HDFS                                                   HIVE




                                                                                                                 34

More Related Content

What's hot

Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kai Wähner
 
Cassandra Data Modeling - Practical Considerations @ Netflix
Cassandra Data Modeling - Practical Considerations @ NetflixCassandra Data Modeling - Practical Considerations @ Netflix
Cassandra Data Modeling - Practical Considerations @ Netflix
nkorla1share
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
Bill Liu
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala Internals
David Groozman
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
larsgeorge
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Apache Hadoop Security - Ranger
Apache Hadoop Security - RangerApache Hadoop Security - Ranger
Apache Hadoop Security - Ranger
Isheeta Sanghi
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Databricks
 
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
DataWorks Summit
 
AF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashAF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on Flash
Ceph Community
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
Databricks
 
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorApache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Databricks
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
C4Media
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the Field
DataWorks Summit
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai
 

What's hot (20)

Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid Cloud
 
Cassandra Data Modeling - Practical Considerations @ Netflix
Cassandra Data Modeling - Practical Considerations @ NetflixCassandra Data Modeling - Practical Considerations @ Netflix
Cassandra Data Modeling - Practical Considerations @ Netflix
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala Internals
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Apache Hadoop Security - Ranger
Apache Hadoop Security - RangerApache Hadoop Security - Ranger
Apache Hadoop Security - Ranger
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
 
AF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashAF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on Flash
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorApache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the Field
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 

Similar to Hadoop Backup and Disaster Recovery

Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
clairvoyantllc
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
Derek Chen
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
Ayyappan Paramesh
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basicssaili mane
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
Ovidiu Dimulescu
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
vijayapraba1
 
Chapter2.pdf
Chapter2.pdfChapter2.pdf
Chapter2.pdf
WasyihunSema2
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
AmirReza Mohammadi
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructureelliando dias
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
Kelly Technologies
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
Joe Alex
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jha
Data Con LA
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemVaibhav Jain
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
VMware Tanzu
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataCyanny LIANG
 
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
Michael Stack
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
[B4]deview 2012-hdfs
[B4]deview 2012-hdfs[B4]deview 2012-hdfs
[B4]deview 2012-hdfsNAVER D2
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Esther Kundin
 

Similar to Hadoop Backup and Disaster Recovery (20)

Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
Chapter2.pdf
Chapter2.pdfChapter2.pdf
Chapter2.pdf
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jha
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big data
 
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
[B4]deview 2012-hdfs
[B4]deview 2012-hdfs[B4]deview 2012-hdfs
[B4]deview 2012-hdfs
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 

Recently uploaded (20)

JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 

Hadoop Backup and Disaster Recovery

  • 1. Hadoop Backup and Disaster Recovery Jai Ranganathan Cloudera Inc
  • 2. What makes Hadoop different? Not much EXCEPT • Tera- to Peta-bytes of data • Commodity hardware • Highly distributed • Many different services
  • 3. What needs protection? Data Sets: Applications: Configuration: System Knobs and applications (JT, configurations Data & Meta-data NN, Region necessary to run about your data Servers, etc) and applications (Hive) User applications
  • 4. We will focus on…. Data Sets but not because the others aren’t important.. Existing systems & processes can help manage Apps & Configuration (to some extent)
  • 5. Classes of Problems to Plan For Hardware Failures • Data corruption on disk • Disk/Node crash • Rack failure User/Application Error • Accidental or malicious data deletion • Corrupted data writes Site Failures • Permanent site loss – fire, ice, etc • Temporary site loss – Network, Power, etc (more common)
  • 6. Business goals must drive solutions RPOs and RTOs are awesome… But plan for what you care about – how much is this data worth? Failure mode Risk Cost Disk failure High Low Node failure High Low Rack failure Medium Medium Accidental deletes Medium Medium Site loss Low High
  • 7. Basics of HDFS* * From Hadoop documentation
  • 8. Hardware failures – Data Corruption Data corruption on disk • Checksums metadata for each block stored with file • If checksums do not match, name node discards block and replaces with fresh copy • Name node can write metadata to multiple copies for safety – write to different file systems and make backups
  • 9. Hardware Failures - Crashes Disk/Node crash • Synchronous replicationon disk day- first Data corruption saves the two replicas always on different hosts • Hardware failure detected by heartbeat loss • Name node HA for meta-data • HDFS automatically re-replicates blocks without enough replicas through periodic process
  • 10. Hardware Failures – Rack failure Rack failure • Configure corruption on diskprovide rack Data at least 3 replicas and information ( topology.node.switch.mapping.impl or topology.script.file.name) • 3rd replica always in a different rack • 3rd is important – allows for time window between failure and detection to safely exist
  • 11. Don’t forget metadata • Your data is defined by Hive metadata • But this is easy! SQL backups as per usual for Hive safety
  • 12. Cool.. Basic hardware is under control Not quite • Employ Monitoring to track node health • Examine data node block scanner reports (http://datanode:50075/blockScannerReport) • Hadoop fsck is your friend Of course, your friendly neighborhood Hadoop vendor has tools – Cloudera Manager health checks FTW!
  • 13. Phew.. Past the easy stuff One more small detail… Upgrades for HDFS should be treated with care On-disk layout changes are risky! • Save name node meta-data offsite • Test upgrade on smaller cluster before pushing out • Data layout upgrades support roll-back but be safe • Making backups of all or important data to remote location before upgrade!
  • 14. Application or user errors Permissions scope Users only have access to data they must have access to Apply the principle of least Quota management privilege Name quota: Limits number of files rooted at dir Space quota: Limit bytes of files rooted at dir
  • 15. Protecting against accidental deletes Trash server When enabled, files are deleted into trash Enable using fs.trash.interval to set trash interval Keep in mind: • Trash deletion only works through fs shell – programmatic deletes will not employ Trash • .Trash is a per user directory for restores
  • 16. Accidental deletes – don’t forget metadata • Again, regular SQL backups is key
  • 17. HDFS Snapshots What are snapshots? Snapshots represent state of the system at a point in time Often implemented using copy-on-write semantics • In HDFS, append-only fs means only deletes have to be managed • Many of the problems with COW are gone!
  • 18. HDFS Snapshots – coming to a distro near you Community is hard at work on HDFS snapshots Expect availability in major distros within the year Some implementation details – NameNode snapshotting: • Very fast snapping capability • Consistency guarantees • Restores need to perform data copy • .snapshot directories for access to individual files
  • 19. What can HDFS Snapshots do for you? • Handles user/application data corruption • Handles accidental deletes • Can also be used for Test/Dev purposes!
  • 20. HBase snapshots Oh hello, HBase! Very similar construct to HDFS snapshots COW model • Fast snaps • Consistent snapshots • Restores still need a copy (hey, at least we are consistent)
  • 21. Hive metadata The recurring theme of data + meta-data Ideally, metadata backed up in the same flow as the core data Consistency of data and metadata is really important
  • 22. Management of snapshots Space considerations: • % of cluster for snapshots • Number of snapshots • Alerting on space issues Scheduling backups: • Time based • Workflow based
  • 23. Great… Are we done? Don’t forget Roger Duronio! Principle of least privilege still matters…
  • 24. Disaster Recovery Datacenter A Datacenter B HDFS Hive HBase
  • 25. Teeing vs Copying Teeing Copying Data is copied from Send data during ingest production to replica as a phase to production and separate step after replica clusters processing • Time delay is minimal • Consistent data between clusters between both sites • Bandwidth required • Process once only could be larger • Time delay for RPO • Requires re-processing objectives to do data on both sides incremental copy • No consistency between • More bandwidth sites needed
  • 26. Recommendations? Scenario dependent But Generally prefer copying over teeing
  • 27. How to replicate – per service HDFS HBase Hive Teeing: Teeing: Flume and Teeing: Application Sqoop support NA level teeing teeing Copying: Copying: Copying: DistCP for copying HBase Database replication import/export* * Database import/export isn’t the full story
  • 28. Hive metadata The recurring theme of data + meta-data Ideally, metadata backed up in the same flow as the core data Consistency of data and metadata is really important
  • 29. Key considerations for large data movement • Is your data compressed? – None of the systems support compression on the wire natively – WAN accelerators can help but cost $$ • Do you know your bandwidth needs? – Initial data load – Daily ingest rate – Maintain historical information • Do you know your network security setup? – Data nodes & Region Servers talk to each other – they need to be able to have network connectivity • Have you configured security appropriately? – Kerberos support for cross-realm trust is challenging • What about cross-version copying? – Can’t always have both clusters be same version – but this is not trivial
  • 30. Management of replications Scheduling replication jobs • Time based • Workflow based – Kicked off from Oozie script? Prioritization • Keep replications in a separate scheduler group and dedicate capacity to replication jobs • Don’t schedule more map tasks than can handle available network bandwidth between sites
  • 31. Secondary configuration and usage Hardware considerations • Denser disk configurations acceptable on remote site depending on workload goals – 4 TB disks vs 2 TB disks, etc • Fewer nodes are typical – consider replicating only critical data. Be careful playing with replication factors Usage considerations • Physical partitioning means a great place for ad-hoc analytics • Production workloads continue to run on core cluster but ad-hoc analytics on replica cluster • For HBase, all clusters can be used for data serving!
  • 32. What about external systems? • Backing up to external systems is a 1 way street with large data volumes • Can’t do useful processing on the other side • Cost of hadoop storage is fairly low, especially if you can drive work on it
  • 33. Summary • It can be done! • Lots of gotchas and details to track in the process • We haven’t even talked about applications and configuration! • Failure workflows are important too – testing, testing, testing
  • 34. Cloudera Enterprise BDR CLOUDERA ENTERPRISE CLOUDERA MANAGER SELECT CONFIGURE SYNCHRONIZE MONITOR DISASTER RECOVERY MODULE CDH HDFS DISTRIBUTED REPLICATION HIVE METASTORE REPLICATION HIGH PERFORMANCE REPLICATION THE ONLY DISASTER RECOVERY SOLUTION USING MAPREDUCE FOR METADATA HDFS HIVE 34

Editor's Notes

  1. Data movement is expensiveHardware more likely to failMore complex interactions in distributed environmentEach service requires different hand-holding
  2. Keep in mind that configuration may not even make sense to replicate – remote side may have different configuration options
  3. Data is split into blocks: Default 128 MBBlocks are replicated: Default: 3 timesHDFS is rack aware
  4. Cloudera Manager helps with replication by managing versions as well
  5. Cross-version managementImproveddistcpHive export/import with updatesSimple UI