Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017


Published on

All modern Big Data solutions, like Hadoop, Kafka or the rest of the ecosystem tools, are designed as distributed processes and as such include some sort of redundancy for High Availability.

Big Data Spain 2017
November 16th - 17th Kinépolis Madrid

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017

  1. 1. Disaster Recovery for Big Data
  2. 2. About us  We are nerds!  Started working in Big Data for international companies  Founded a start-up a few years ago:  With colleagues working in related technical areas  And who also knew business stuff!  We’ve been participating in different Big Data projects
  3. 3. Introduction “I already have HDFS replication and High Availability in my services, why would I need Disaster Recovery (or backup)?”
  4. 4. Concepts  High Availability (HA)  Protects from failing components: disks, servers, network  Is generally a “systems” issue  Redundant, doubles components  Generally has strict network requirements  Fully automated, immediate
  5. 5. Concepts  Backup  Allows you to go back to a previous state in time: daily, monthly, etc.  It is a “data” issue  Protects from accidental deletion or modification  Also used to check for unwanted modifications  Takes some time to restore
  6. 6. Concepts  Disaster Recovery  Allows you to work elsewhere  It is a “business” issue  Covers you from: main site failures such as electric power or network outages, fires, floods or building damage  Similar to having insurance  Medium time to be back online
  7. 7. The ideal Disaster Recovery  High Availability for datacenters  Exact duplicate of the main site  Seamless operation (no changes required)  Same performance  Same data  This is often very expensive and sometimes downright impossible
  8. 8. DR considerations  So, can we build a cheap(ish) DR?  We must evaluate some tradeoffs:  What’s the cost of the service not being available? (Murphy’s Law: accidents will happen when you are busiest)  Is all information equally important? Can we lose a small amount of data?  Can we wait until we recover certain data from backup?  Can I find other uses for the DR site?
  9. 9. DR considerations  Near or far?  Availability  Latency  Legal considerations
  10. 10. DR considerations  Synchronous vs Asynchronous  Synchronous replication requires a FAST connection  Synchronous works at transaction level and is necessary for operational systems  Asynchronous replication converges over time  Asynchronous is not affected by delays nor does it create them
  11. 11. Big Data DR  Can’t generally be copied synchronously  No VM replication  Other DR rules apply:  Since it impacts users, someone is in charge of the “starting gun”  DNS and network changes to point clients  Main types:  Storage replication  Dual ingestion
  12. 12. Storage replication  Similar to non-Big Data solutions, where central storage is replicated  Generally implemented using distcp and HDFS snapshots  Data is ingested in source cluster and then copied
  13. 13. Storage replication  Administrative overhead:  Copy jobs must be scheduled  Metadata changes must be tracked  Good enough for data that comes traditional ETLs such as daily batches
  14. 14. Dual Ingestion  No files, just streams  Generally ingested from multiple outside sources through Kafka  Streams must be directed to both sites
  15. 15. Dual Ingestion  Adds complexity to apps  Nifi can be set up as a front-end to both endpoints  Data consistency must be checked  Can be automatically set up via monitoring  Consolidation processes (such as a monthly re-sync) might be needed
  16. 16. Others  Ingestion replication  Variant of the dual ingestion  A consumer is set up in the source Kafka that in turn writes to a destination Kafka  Bottleneck if the initial streams were generated by many producers  Mixed:  Previous solutions are not mutually exclusive  Storage replication for batch processes’ results  Dual ingestion for streams
  17. 17. Commercial offerings  Solutions that ease DR setup  Cloudera BDR  Coordinates HDFS snapshots and copy  WANdisco Fusion  Continuous storage replication  Confluent Multi-site  Allows multi-site Kafka data replication
  18. 18. Tips  Big Data clusters have many nodes  Costly to replicate  Performance / Capacity tradeoff  We can use cheaper servers in DR, since we don’t expect to use them often
  19. 19. Tips  Document and test procedures  DR is rarely fully automated, so responsibilities and actions should be clearly defined  Plan for (at least) a yearly DR run  Track changes in software and configuration
  20. 20. Tips  Once you have a DR solution, other uses will surface  DR site can be used for backup  Maintain HDFS snapshots  DR data can be used for testing / reporting  Warning: it may alter stored data
  21. 21. Conclusions  Balance HA / Backup / DR as needed, they are not exclusive:  Different costs  Different impact  Big Data DR is different:  Dedicated hardware  No VMs, no storage cabin  Plan for DATA CENTRIC solutions
  22. 22. Questions