Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Hadoop Backup and Disaster        Recovery       Jai Ranganathan         Cloudera Inc
What makes Hadoop different?            Not much            EXCEPT    • Tera- to Peta-bytes of data      • Commodity hardw...
What needs protection?  Data Sets:       Applications:       Configuration:                        System            Knobs...
We will focus on….              Data Setsbut not because the others aren’t important..  Existing systems & processes can h...
Classes of Problems to Plan ForHardware Failures • Data corruption on disk • Disk/Node crash • Rack failureUser/Applicatio...
Business goals must drive solutions        RPOs and RTOs are awesome…But plan for what you care about – how much is       ...
Basics of HDFS*          * From Hadoop documentation
Hardware failures – Data Corruption  Data corruption on disk • Checksums metadata for each block stored   with file • If c...
Hardware Failures - CrashesDisk/Node crash• Synchronous replicationon disk day- first     Data corruption saves the  two r...
Hardware Failures – Rack failure Rack failure • Configure corruption on diskprovide rack       Data at least 3 replicas an...
Don’t forget metadata   • Your data is defined by Hive metadata• But this is easy! SQL backups as per usual for           ...
Cool.. Basic hardware is under control                   Not quite      • Employ Monitoring to track node health     • Exa...
Phew.. Past the easy stuff              One more small detail…   Upgrades for HDFS should be treated with care         On-...
Application or user errors                     Permissions scope                  Users only have access to data they     ...
Protecting against accidental deletes                         Trash server             When enabled, files are deleted int...
Accidental deletes – don’t forget           metadata  • Again, regular SQL backups is key
HDFS Snapshots             What are snapshots?Snapshots represent state of the system at a point                    in tim...
HDFS Snapshots – coming to a distro            near you Community is hard at work on HDFS snapshotsExpect availability in ...
What can HDFS Snapshots do for you?  • Handles user/application data corruption         • Handles accidental deletes   • C...
HBase snapshots            Oh hello, HBase!Very similar construct to HDFS snapshots               COW model               ...
Hive metadata   The recurring theme of data + meta-dataIdeally, metadata backed up in the same flow as the                ...
Management of snapshotsSpace considerations:• % of cluster for snapshots• Number of snapshots• Alerting on space issuesSch...
Great… Are we done?        Don’t forget Roger Duronio!Principle of least privilege still matters…
Disaster Recovery  Datacenter A              Datacenter BHDFS   Hive   HBase
Teeing vs Copying         Teeing                     Copying                               Data is copied from Send data d...
Recommendations?       Scenario dependent                ButGenerally prefer copying over teeing
How to replicate – per serviceHDFS                   HBase                                 Hive       Teeing:             ...
Hive metadata   The recurring theme of data + meta-dataIdeally, metadata backed up in the same flow as the                ...
Key considerations for large data                   movement•   Is your data compressed?     – None of the systems support...
Management of replicationsScheduling replication jobs• Time based• Workflow based – Kicked off from Oozie script?Prioritiz...
Secondary configuration and usageHardware considerations• Denser disk configurations acceptable on remote site  depending ...
What about external systems?• Backing up to external systems is a 1 way  street with large data volumes• Can’t do useful p...
Summary• It can be done!• Lots of gotchas and details to track in the process• We haven’t even talked about applications a...
Cloudera Enterprise BDRCLOUDERA ENTERPRISECLOUDERA MANAGER         SELECT                   CONFIGURE                  SYN...
Upcoming SlideShare
Loading in …5
×

Hadoop Backup and Disaster Recovery

56,573 views

Published on

Jai Ranganathan's presentation from Big Data TechCon 2013.

Published in: Technology

Hadoop Backup and Disaster Recovery

  1. 1. Hadoop Backup and Disaster Recovery Jai Ranganathan Cloudera Inc
  2. 2. What makes Hadoop different? Not much EXCEPT • Tera- to Peta-bytes of data • Commodity hardware • Highly distributed • Many different services
  3. 3. What needs protection? Data Sets: Applications: Configuration: System Knobs and applications (JT, configurationsData & Meta-data NN, Region necessary to run about your data Servers, etc) and applications (Hive) User applications
  4. 4. We will focus on…. Data Setsbut not because the others aren’t important.. Existing systems & processes can help manage Apps & Configuration (to some extent)
  5. 5. Classes of Problems to Plan ForHardware Failures • Data corruption on disk • Disk/Node crash • Rack failureUser/Application Error • Accidental or malicious data deletion • Corrupted data writesSite Failures • Permanent site loss – fire, ice, etc • Temporary site loss – Network, Power, etc (more common)
  6. 6. Business goals must drive solutions RPOs and RTOs are awesome…But plan for what you care about – how much is this data worth?Failure mode Risk CostDisk failure High LowNode failure High LowRack failure Medium MediumAccidental deletes Medium MediumSite loss Low High
  7. 7. Basics of HDFS* * From Hadoop documentation
  8. 8. Hardware failures – Data Corruption Data corruption on disk • Checksums metadata for each block stored with file • If checksums do not match, name node discards block and replaces with fresh copy • Name node can write metadata to multiple copies for safety – write to different file systems and make backups
  9. 9. Hardware Failures - CrashesDisk/Node crash• Synchronous replicationon disk day- first Data corruption saves the two replicas always on different hosts• Hardware failure detected by heartbeat loss• Name node HA for meta-data• HDFS automatically re-replicates blocks without enough replicas through periodic process
  10. 10. Hardware Failures – Rack failure Rack failure • Configure corruption on diskprovide rack Data at least 3 replicas and information ( topology.node.switch.mapping.impl or topology.script.file.name) • 3rd replica always in a different rack • 3rd is important – allows for time window between failure and detection to safely exist
  11. 11. Don’t forget metadata • Your data is defined by Hive metadata• But this is easy! SQL backups as per usual for Hive safety
  12. 12. Cool.. Basic hardware is under control Not quite • Employ Monitoring to track node health • Examine data node block scanner reports (http://datanode:50075/blockScannerReport) • Hadoop fsck is your friendOf course, your friendly neighborhood Hadoop vendor has tools – Cloudera Manager health checks FTW!
  13. 13. Phew.. Past the easy stuff One more small detail… Upgrades for HDFS should be treated with care On-disk layout changes are risky! • Save name node meta-data offsite• Test upgrade on smaller cluster before pushing out• Data layout upgrades support roll-back but be safe• Making backups of all or important data to remote location before upgrade!
  14. 14. Application or user errors Permissions scope Users only have access to data they must have access to Apply theprinciple of least Quota management privilege Name quota: Limits number of files rooted at dir Space quota: Limit bytes of files rooted at dir
  15. 15. Protecting against accidental deletes Trash server When enabled, files are deleted into trash Enable using fs.trash.interval to set trash interval Keep in mind:• Trash deletion only works through fs shell – programmatic deletes will not employ Trash• .Trash is a per user directory for restores
  16. 16. Accidental deletes – don’t forget metadata • Again, regular SQL backups is key
  17. 17. HDFS Snapshots What are snapshots?Snapshots represent state of the system at a point in timeOften implemented using copy-on-write semantics• In HDFS, append-only fs means only deletes have to be managed • Many of the problems with COW are gone!
  18. 18. HDFS Snapshots – coming to a distro near you Community is hard at work on HDFS snapshotsExpect availability in major distros within the year Some implementation details – NameNode snapshotting: • Very fast snapping capability • Consistency guarantees • Restores need to perform data copy• .snapshot directories for access to individual files
  19. 19. What can HDFS Snapshots do for you? • Handles user/application data corruption • Handles accidental deletes • Can also be used for Test/Dev purposes!
  20. 20. HBase snapshots Oh hello, HBase!Very similar construct to HDFS snapshots COW model • Fast snaps • Consistent snapshots • Restores still need a copy (hey, at least we are consistent)
  21. 21. Hive metadata The recurring theme of data + meta-dataIdeally, metadata backed up in the same flow as the core data Consistency of data and metadata is really important
  22. 22. Management of snapshotsSpace considerations:• % of cluster for snapshots• Number of snapshots• Alerting on space issuesScheduling backups:• Time based• Workflow based
  23. 23. Great… Are we done? Don’t forget Roger Duronio!Principle of least privilege still matters…
  24. 24. Disaster Recovery Datacenter A Datacenter BHDFS Hive HBase
  25. 25. Teeing vs Copying Teeing Copying Data is copied from Send data during ingest production to replica as a phase to production and separate step after replica clusters processing• Time delay is minimal • Consistent data between clusters between both sites• Bandwidth required • Process once only could be larger • Time delay for RPO• Requires re-processing objectives to do data on both sides incremental copy• No consistency between • More bandwidth sites needed
  26. 26. Recommendations? Scenario dependent ButGenerally prefer copying over teeing
  27. 27. How to replicate – per serviceHDFS HBase Hive Teeing: Teeing: Flume and Teeing: Application Sqoop support NA level teeing teeing Copying: Copying: Copying: DistCP for copying HBase Database replication import/export* * Database import/export isn’t the full story
  28. 28. Hive metadata The recurring theme of data + meta-dataIdeally, metadata backed up in the same flow as the core data Consistency of data and metadata is really important
  29. 29. Key considerations for large data movement• Is your data compressed? – None of the systems support compression on the wire natively – WAN accelerators can help but cost $$• Do you know your bandwidth needs? – Initial data load – Daily ingest rate – Maintain historical information• Do you know your network security setup? – Data nodes & Region Servers talk to each other – they need to be able to have network connectivity• Have you configured security appropriately? – Kerberos support for cross-realm trust is challenging• What about cross-version copying? – Can’t always have both clusters be same version – but this is not trivial
  30. 30. Management of replicationsScheduling replication jobs• Time based• Workflow based – Kicked off from Oozie script?Prioritization• Keep replications in a separate scheduler group and dedicate capacity to replication jobs• Don’t schedule more map tasks than can handle available network bandwidth between sites
  31. 31. Secondary configuration and usageHardware considerations• Denser disk configurations acceptable on remote site depending on workload goals – 4 TB disks vs 2 TB disks, etc• Fewer nodes are typical – consider replicating only critical data. Be careful playing with replication factorsUsage considerations• Physical partitioning means a great place for ad-hoc analytics• Production workloads continue to run on core cluster but ad-hoc analytics on replica cluster• For HBase, all clusters can be used for data serving!
  32. 32. What about external systems?• Backing up to external systems is a 1 way street with large data volumes• Can’t do useful processing on the other side• Cost of hadoop storage is fairly low, especially if you can drive work on it
  33. 33. Summary• It can be done!• Lots of gotchas and details to track in the process• We haven’t even talked about applications and configuration!• Failure workflows are important too – testing, testing, testing
  34. 34. Cloudera Enterprise BDRCLOUDERA ENTERPRISECLOUDERA MANAGER SELECT CONFIGURE SYNCHRONIZE MONITOR DISASTER RECOVERY MODULECDH HDFS DISTRIBUTED REPLICATION HIVE METASTORE REPLICATION HIGH PERFORMANCE REPLICATION THE ONLY DISASTER RECOVERY SOLUTION USING MAPREDUCE FOR METADATA HDFS HIVE 34

×