SlideShare a Scribd company logo
1 of 34
Hadoop Backup and Disaster
        Recovery
       Jai Ranganathan
         Cloudera Inc
What makes Hadoop different?

            Not much

            EXCEPT
    • Tera- to Peta-bytes of data
      • Commodity hardware
        • Highly distributed
     • Many different services
What needs protection?

  Data Sets:       Applications:       Configuration:
                        System            Knobs and
                   applications (JT,    configurations
Data & Meta-data
                     NN, Region        necessary to run
 about your data
                   Servers, etc) and     applications
     (Hive)
                   User applications
We will focus on….


              Data Sets
but not because the others aren’t important..

  Existing systems & processes can help
  manage Apps & Configuration (to some
                 extent)
Classes of Problems to Plan For
Hardware Failures
 • Data corruption on disk
 • Disk/Node crash
 • Rack failure


User/Application Error
 • Accidental or malicious data deletion
 • Corrupted data writes


Site Failures
 • Permanent site loss – fire, ice, etc
 • Temporary site loss – Network, Power, etc (more common)
Business goals must drive solutions
        RPOs and RTOs are awesome…
But plan for what you care about – how much is
               this data worth?
Failure mode          Risk           Cost

Disk failure          High           Low

Node failure          High           Low

Rack failure         Medium         Medium

Accidental deletes   Medium         Medium

Site loss             Low            High
Basics of HDFS*




          * From Hadoop documentation
Hardware failures – Data Corruption
  Data corruption on disk


 • Checksums metadata for each block stored
   with file
 • If checksums do not match, name node
   discards block and replaces with fresh copy
 • Name node can write metadata to multiple
   copies for safety – write to different file
   systems and make backups
Hardware Failures - Crashes
Disk/Node crash


• Synchronous replicationon disk day- first
     Data corruption saves the
  two replicas always on different hosts
• Hardware failure detected by heartbeat loss
• Name node HA for meta-data
• HDFS automatically re-replicates blocks
  without enough replicas through periodic
  process
Hardware Failures – Rack failure
 Rack failure


 • Configure corruption on diskprovide rack
       Data at least 3 replicas and
   information (
   topology.node.switch.mapping.impl or
   topology.script.file.name)
 • 3rd replica always in a different rack
 • 3rd is important – allows for time window
   between failure and detection to safely exist
Don’t forget metadata


   • Your data is defined by Hive metadata
• But this is easy! SQL backups as per usual for
                     Hive safety
Cool.. Basic hardware is under control
                   Not quite
      • Employ Monitoring to track node health
     • Examine data node block scanner reports
    (http://datanode:50075/blockScannerReport)
             • Hadoop fsck is your friend


Of course, your friendly neighborhood Hadoop vendor
  has tools – Cloudera Manager health checks FTW!
Phew.. Past the easy stuff
              One more small detail…

   Upgrades for HDFS should be treated with care
         On-disk layout changes are risky!

        • Save name node meta-data offsite
• Test upgrade on smaller cluster before pushing out
• Data layout upgrades support roll-back but be safe
• Making backups of all or important data to remote
               location before upgrade!
Application or user errors

                     Permissions scope
                  Users only have access to data they
                         must have access to
  Apply the
principle of
   least            Quota management
 privilege            Name quota: Limits number of
                             files rooted at dir
                      Space quota: Limit bytes of files
                                rooted at dir
Protecting against accidental deletes

                         Trash server
             When enabled, files are deleted into
                            trash
             Enable using fs.trash.interval to set
                        trash interval

                    Keep in mind:
• Trash deletion only works through fs shell –
  programmatic deletes will not employ Trash
• .Trash is a per user directory for restores
Accidental deletes – don’t forget
           metadata



  • Again, regular SQL backups is key
HDFS Snapshots
             What are snapshots?
Snapshots represent state of the system at a point
                    in time
Often implemented using copy-on-write semantics



• In HDFS, append-only fs means only deletes have
                  to be managed
   • Many of the problems with COW are gone!
HDFS Snapshots – coming to a distro
            near you

 Community is hard at work on HDFS snapshots
Expect availability in major distros within the year


    Some implementation details – NameNode
                  snapshotting:
         • Very fast snapping capability
            • Consistency guarantees
      • Restores need to perform data copy
• .snapshot directories for access to individual files
What can HDFS Snapshots do for you?


  • Handles user/application data corruption
         • Handles accidental deletes
   • Can also be used for Test/Dev purposes!
HBase snapshots

            Oh hello, HBase!
Very similar construct to HDFS snapshots
               COW model

               • Fast snaps
        • Consistent snapshots
      • Restores still need a copy
    (hey, at least we are consistent)
Hive metadata
   The recurring theme of data + meta-data

Ideally, metadata backed up in the same flow as the
                      core data
     Consistency of data and metadata is really
                     important
Management of snapshots
Space considerations:

• % of cluster for snapshots
• Number of snapshots
• Alerting on space issues

Scheduling backups:

• Time based
• Workflow based
Great… Are we done?

        Don’t forget Roger Duronio!

Principle of least privilege still matters…
Disaster Recovery


  Datacenter A              Datacenter B




HDFS   Hive   HBase
Teeing vs Copying
         Teeing                     Copying

                               Data is copied from
 Send data during ingest
                            production to replica as a
 phase to production and
                               separate step after
     replica clusters
                                   processing
• Time delay is minimal
                           • Consistent data
  between clusters
                             between both sites
• Bandwidth required
                           • Process once only
  could be larger
                           • Time delay for RPO
• Requires re-processing
                             objectives to do
  data on both sides
                             incremental copy
• No consistency between
                           • More bandwidth
  sites
                             needed
Recommendations?


       Scenario dependent
                But
Generally prefer copying over teeing
How to replicate – per service


HDFS                   HBase                                 Hive
       Teeing:
                               Teeing:
       Flume and                                                            Teeing:
                               Application
       Sqoop support                                                        NA
                               level teeing
       teeing


       Copying:
                               Copying:                                     Copying:
       DistCP for
       copying                 HBase                                        Database
                               replication                                  import/export*




                                              * Database import/export isn’t the full story
Hive metadata
   The recurring theme of data + meta-data

Ideally, metadata backed up in the same flow as the
                      core data
     Consistency of data and metadata is really
                     important
Key considerations for large data
                   movement
•   Is your data compressed?
     – None of the systems support compression on the wire natively
     – WAN accelerators can help but cost $$

•   Do you know your bandwidth needs?
     – Initial data load
     – Daily ingest rate – Maintain historical information

•   Do you know your network security setup?
     – Data nodes & Region Servers talk to each other – they need to be able to have network connectivity

•   Have you configured security appropriately?
     – Kerberos support for cross-realm trust is challenging

•   What about cross-version copying?
     – Can’t always have both clusters be same version – but this is not trivial
Management of replications
Scheduling replication jobs

• Time based
• Workflow based – Kicked off from Oozie script?

Prioritization

• Keep replications in a separate scheduler group and
  dedicate capacity to replication jobs
• Don’t schedule more map tasks than can handle
  available network bandwidth between sites
Secondary configuration and usage
Hardware considerations
• Denser disk configurations acceptable on remote site
  depending on workload goals – 4 TB disks vs 2 TB disks, etc
• Fewer nodes are typical – consider replicating only critical
  data. Be careful playing with replication factors

Usage considerations
• Physical partitioning means a great place for ad-hoc
  analytics
• Production workloads continue to run on core cluster but
  ad-hoc analytics on replica cluster
• For HBase, all clusters can be used for data serving!
What about external systems?

• Backing up to external systems is a 1 way
  street with large data volumes

• Can’t do useful processing on the other side

• Cost of hadoop storage is fairly low, especially
  if you can drive work on it
Summary
• It can be done!

• Lots of gotchas and details to track in the process

• We haven’t even talked about applications and
  configuration!

• Failure workflows are important too – testing,
  testing, testing
Cloudera Enterprise BDR

CLOUDERA ENTERPRISE
CLOUDERA MANAGER

         SELECT                   CONFIGURE                  SYNCHRONIZE                   MONITOR


                                          DISASTER RECOVERY MODULE



CDH



                   HDFS DISTRIBUTED REPLICATION                            HIVE METASTORE REPLICATION
                      HIGH PERFORMANCE REPLICATION                         THE ONLY DISASTER RECOVERY SOLUTION
                            USING MAPREDUCE                                           FOR METADATA

         HDFS                                                   HIVE




                                                                                                                 34

More Related Content

What's hot

What to Expect From Oracle database 19c
What to Expect From Oracle database 19cWhat to Expect From Oracle database 19c
What to Expect From Oracle database 19cMaria Colgan
 
Introduction to Apache NiFi dws19 DWS - DC 2019
Introduction to Apache NiFi   dws19 DWS - DC 2019Introduction to Apache NiFi   dws19 DWS - DC 2019
Introduction to Apache NiFi dws19 DWS - DC 2019Timothy Spann
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
 
NiFi Developer Guide
NiFi Developer GuideNiFi Developer Guide
NiFi Developer GuideDeon Huang
 
The Top 5 Reasons to Deploy Your Applications on Oracle RAC
The Top 5 Reasons to Deploy Your Applications on Oracle RACThe Top 5 Reasons to Deploy Your Applications on Oracle RAC
The Top 5 Reasons to Deploy Your Applications on Oracle RACMarkus Michalewicz
 
Managing enterprise users in Hadoop ecosystem
Managing enterprise users in Hadoop ecosystemManaging enterprise users in Hadoop ecosystem
Managing enterprise users in Hadoop ecosystemDataWorks Summit
 
HA, Scalability, DR & MAA in Oracle Database 21c - Overview
HA, Scalability, DR & MAA in Oracle Database 21c - OverviewHA, Scalability, DR & MAA in Oracle Database 21c - Overview
HA, Scalability, DR & MAA in Oracle Database 21c - OverviewMarkus Michalewicz
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Hadoop Security Today & Tomorrow with Apache Knox
Hadoop Security Today & Tomorrow with Apache KnoxHadoop Security Today & Tomorrow with Apache Knox
Hadoop Security Today & Tomorrow with Apache KnoxVinay Shukla
 
MV2ADB - Move to Oracle Autonomous Database in One-click
MV2ADB - Move to Oracle Autonomous Database in One-clickMV2ADB - Move to Oracle Autonomous Database in One-click
MV2ADB - Move to Oracle Autonomous Database in One-clickRuggero Citton
 
Oracle RAC 19c: Best Practices and Secret Internals
Oracle RAC 19c: Best Practices and Secret InternalsOracle RAC 19c: Best Practices and Secret Internals
Oracle RAC 19c: Best Practices and Secret InternalsAnil Nair
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptxDori Waldman
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataDataWorks Summit
 
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TCloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TShapeBlue
 
Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayDataWorks Summit
 
Oracle RAC 19c and Later - Best Practices #OOWLON
Oracle RAC 19c and Later - Best Practices #OOWLONOracle RAC 19c and Later - Best Practices #OOWLON
Oracle RAC 19c and Later - Best Practices #OOWLONMarkus Michalewicz
 
Managing 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with AmbariManaging 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with AmbariDataWorks Summit
 

What's hot (20)

What to Expect From Oracle database 19c
What to Expect From Oracle database 19cWhat to Expect From Oracle database 19c
What to Expect From Oracle database 19c
 
Introduction to Apache NiFi dws19 DWS - DC 2019
Introduction to Apache NiFi   dws19 DWS - DC 2019Introduction to Apache NiFi   dws19 DWS - DC 2019
Introduction to Apache NiFi dws19 DWS - DC 2019
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
NiFi Developer Guide
NiFi Developer GuideNiFi Developer Guide
NiFi Developer Guide
 
The Top 5 Reasons to Deploy Your Applications on Oracle RAC
The Top 5 Reasons to Deploy Your Applications on Oracle RACThe Top 5 Reasons to Deploy Your Applications on Oracle RAC
The Top 5 Reasons to Deploy Your Applications on Oracle RAC
 
Managing enterprise users in Hadoop ecosystem
Managing enterprise users in Hadoop ecosystemManaging enterprise users in Hadoop ecosystem
Managing enterprise users in Hadoop ecosystem
 
HA, Scalability, DR & MAA in Oracle Database 21c - Overview
HA, Scalability, DR & MAA in Oracle Database 21c - OverviewHA, Scalability, DR & MAA in Oracle Database 21c - Overview
HA, Scalability, DR & MAA in Oracle Database 21c - Overview
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Hadoop Security Today & Tomorrow with Apache Knox
Hadoop Security Today & Tomorrow with Apache KnoxHadoop Security Today & Tomorrow with Apache Knox
Hadoop Security Today & Tomorrow with Apache Knox
 
MV2ADB - Move to Oracle Autonomous Database in One-click
MV2ADB - Move to Oracle Autonomous Database in One-clickMV2ADB - Move to Oracle Autonomous Database in One-click
MV2ADB - Move to Oracle Autonomous Database in One-click
 
Oracle RAC 19c: Best Practices and Secret Internals
Oracle RAC 19c: Best Practices and Secret InternalsOracle RAC 19c: Best Practices and Secret Internals
Oracle RAC 19c: Best Practices and Secret Internals
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptx
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
 
Apache Nifi Crash Course
Apache Nifi Crash CourseApache Nifi Crash Course
Apache Nifi Crash Course
 
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TCloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
 
Hadoop security
Hadoop securityHadoop security
Hadoop security
 
Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox Gateway
 
Oracle RAC 19c and Later - Best Practices #OOWLON
Oracle RAC 19c and Later - Best Practices #OOWLONOracle RAC 19c and Later - Best Practices #OOWLON
Oracle RAC 19c and Later - Best Practices #OOWLON
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
 
Managing 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with AmbariManaging 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with Ambari
 

Similar to Hadoop Backup and Disaster Recovery

Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015 clairvoyantllc
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceDerek Chen
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basicssaili mane
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephantsOvidiu Dimulescu
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.pptvijayapraba1
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructureelliando dias
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big DataJoe Alex
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaData Con LA
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemVaibhav Jain
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataCyanny LIANG
 
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...Michael Stack
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
[B4]deview 2012-hdfs
[B4]deview 2012-hdfs[B4]deview 2012-hdfs
[B4]deview 2012-hdfsNAVER D2
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
 

Similar to Hadoop Backup and Disaster Recovery (20)

Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
Chapter2.pdf
Chapter2.pdfChapter2.pdf
Chapter2.pdf
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jha
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big data
 
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
[B4]deview 2012-hdfs
[B4]deview 2012-hdfs[B4]deview 2012-hdfs
[B4]deview 2012-hdfs
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)Samir Dash
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMKumar Satyam
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 

Recently uploaded (20)

Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 

Hadoop Backup and Disaster Recovery

  • 1. Hadoop Backup and Disaster Recovery Jai Ranganathan Cloudera Inc
  • 2. What makes Hadoop different? Not much EXCEPT • Tera- to Peta-bytes of data • Commodity hardware • Highly distributed • Many different services
  • 3. What needs protection? Data Sets: Applications: Configuration: System Knobs and applications (JT, configurations Data & Meta-data NN, Region necessary to run about your data Servers, etc) and applications (Hive) User applications
  • 4. We will focus on…. Data Sets but not because the others aren’t important.. Existing systems & processes can help manage Apps & Configuration (to some extent)
  • 5. Classes of Problems to Plan For Hardware Failures • Data corruption on disk • Disk/Node crash • Rack failure User/Application Error • Accidental or malicious data deletion • Corrupted data writes Site Failures • Permanent site loss – fire, ice, etc • Temporary site loss – Network, Power, etc (more common)
  • 6. Business goals must drive solutions RPOs and RTOs are awesome… But plan for what you care about – how much is this data worth? Failure mode Risk Cost Disk failure High Low Node failure High Low Rack failure Medium Medium Accidental deletes Medium Medium Site loss Low High
  • 7. Basics of HDFS* * From Hadoop documentation
  • 8. Hardware failures – Data Corruption Data corruption on disk • Checksums metadata for each block stored with file • If checksums do not match, name node discards block and replaces with fresh copy • Name node can write metadata to multiple copies for safety – write to different file systems and make backups
  • 9. Hardware Failures - Crashes Disk/Node crash • Synchronous replicationon disk day- first Data corruption saves the two replicas always on different hosts • Hardware failure detected by heartbeat loss • Name node HA for meta-data • HDFS automatically re-replicates blocks without enough replicas through periodic process
  • 10. Hardware Failures – Rack failure Rack failure • Configure corruption on diskprovide rack Data at least 3 replicas and information ( topology.node.switch.mapping.impl or topology.script.file.name) • 3rd replica always in a different rack • 3rd is important – allows for time window between failure and detection to safely exist
  • 11. Don’t forget metadata • Your data is defined by Hive metadata • But this is easy! SQL backups as per usual for Hive safety
  • 12. Cool.. Basic hardware is under control Not quite • Employ Monitoring to track node health • Examine data node block scanner reports (http://datanode:50075/blockScannerReport) • Hadoop fsck is your friend Of course, your friendly neighborhood Hadoop vendor has tools – Cloudera Manager health checks FTW!
  • 13. Phew.. Past the easy stuff One more small detail… Upgrades for HDFS should be treated with care On-disk layout changes are risky! • Save name node meta-data offsite • Test upgrade on smaller cluster before pushing out • Data layout upgrades support roll-back but be safe • Making backups of all or important data to remote location before upgrade!
  • 14. Application or user errors Permissions scope Users only have access to data they must have access to Apply the principle of least Quota management privilege Name quota: Limits number of files rooted at dir Space quota: Limit bytes of files rooted at dir
  • 15. Protecting against accidental deletes Trash server When enabled, files are deleted into trash Enable using fs.trash.interval to set trash interval Keep in mind: • Trash deletion only works through fs shell – programmatic deletes will not employ Trash • .Trash is a per user directory for restores
  • 16. Accidental deletes – don’t forget metadata • Again, regular SQL backups is key
  • 17. HDFS Snapshots What are snapshots? Snapshots represent state of the system at a point in time Often implemented using copy-on-write semantics • In HDFS, append-only fs means only deletes have to be managed • Many of the problems with COW are gone!
  • 18. HDFS Snapshots – coming to a distro near you Community is hard at work on HDFS snapshots Expect availability in major distros within the year Some implementation details – NameNode snapshotting: • Very fast snapping capability • Consistency guarantees • Restores need to perform data copy • .snapshot directories for access to individual files
  • 19. What can HDFS Snapshots do for you? • Handles user/application data corruption • Handles accidental deletes • Can also be used for Test/Dev purposes!
  • 20. HBase snapshots Oh hello, HBase! Very similar construct to HDFS snapshots COW model • Fast snaps • Consistent snapshots • Restores still need a copy (hey, at least we are consistent)
  • 21. Hive metadata The recurring theme of data + meta-data Ideally, metadata backed up in the same flow as the core data Consistency of data and metadata is really important
  • 22. Management of snapshots Space considerations: • % of cluster for snapshots • Number of snapshots • Alerting on space issues Scheduling backups: • Time based • Workflow based
  • 23. Great… Are we done? Don’t forget Roger Duronio! Principle of least privilege still matters…
  • 24. Disaster Recovery Datacenter A Datacenter B HDFS Hive HBase
  • 25. Teeing vs Copying Teeing Copying Data is copied from Send data during ingest production to replica as a phase to production and separate step after replica clusters processing • Time delay is minimal • Consistent data between clusters between both sites • Bandwidth required • Process once only could be larger • Time delay for RPO • Requires re-processing objectives to do data on both sides incremental copy • No consistency between • More bandwidth sites needed
  • 26. Recommendations? Scenario dependent But Generally prefer copying over teeing
  • 27. How to replicate – per service HDFS HBase Hive Teeing: Teeing: Flume and Teeing: Application Sqoop support NA level teeing teeing Copying: Copying: Copying: DistCP for copying HBase Database replication import/export* * Database import/export isn’t the full story
  • 28. Hive metadata The recurring theme of data + meta-data Ideally, metadata backed up in the same flow as the core data Consistency of data and metadata is really important
  • 29. Key considerations for large data movement • Is your data compressed? – None of the systems support compression on the wire natively – WAN accelerators can help but cost $$ • Do you know your bandwidth needs? – Initial data load – Daily ingest rate – Maintain historical information • Do you know your network security setup? – Data nodes & Region Servers talk to each other – they need to be able to have network connectivity • Have you configured security appropriately? – Kerberos support for cross-realm trust is challenging • What about cross-version copying? – Can’t always have both clusters be same version – but this is not trivial
  • 30. Management of replications Scheduling replication jobs • Time based • Workflow based – Kicked off from Oozie script? Prioritization • Keep replications in a separate scheduler group and dedicate capacity to replication jobs • Don’t schedule more map tasks than can handle available network bandwidth between sites
  • 31. Secondary configuration and usage Hardware considerations • Denser disk configurations acceptable on remote site depending on workload goals – 4 TB disks vs 2 TB disks, etc • Fewer nodes are typical – consider replicating only critical data. Be careful playing with replication factors Usage considerations • Physical partitioning means a great place for ad-hoc analytics • Production workloads continue to run on core cluster but ad-hoc analytics on replica cluster • For HBase, all clusters can be used for data serving!
  • 32. What about external systems? • Backing up to external systems is a 1 way street with large data volumes • Can’t do useful processing on the other side • Cost of hadoop storage is fairly low, especially if you can drive work on it
  • 33. Summary • It can be done! • Lots of gotchas and details to track in the process • We haven’t even talked about applications and configuration! • Failure workflows are important too – testing, testing, testing
  • 34. Cloudera Enterprise BDR CLOUDERA ENTERPRISE CLOUDERA MANAGER SELECT CONFIGURE SYNCHRONIZE MONITOR DISASTER RECOVERY MODULE CDH HDFS DISTRIBUTED REPLICATION HIVE METASTORE REPLICATION HIGH PERFORMANCE REPLICATION THE ONLY DISASTER RECOVERY SOLUTION USING MAPREDUCE FOR METADATA HDFS HIVE 34

Editor's Notes

  1. Data movement is expensiveHardware more likely to failMore complex interactions in distributed environmentEach service requires different hand-holding
  2. Keep in mind that configuration may not even make sense to replicate – remote side may have different configuration options
  3. Data is split into blocks: Default 128 MBBlocks are replicated: Default: 3 timesHDFS is rack aware
  4. Cloudera Manager helps with replication by managing versions as well
  5. Cross-version managementImproveddistcpHive export/import with updatesSimple UI