Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017

Disaster Recovery for Big Data

About us

We are nerds!

Started working in Big Data for international companies

Founded a start-up a few years ago:
 With colleagues working in related technical areas
 And who also knew business stuff!

We’ve been participating in different Big Data projects

Introduction
“I already have HDFS replication and High
Availability in my services, why would I
need Disaster Recovery (or backup)?”

Concepts

High Availability (HA)
 Protects from failing
components: disks,
servers, network
 Is generally a “systems”
issue
 Redundant, doubles
components
 Generally has strict
network requirements
 Fully automated,
immediate

Concepts

Backup
 Allows you to go back to
a previous state in time:
daily, monthly, etc.
 It is a “data” issue
 Protects from accidental
deletion or modification
 Also used to check for
unwanted modifications
 Takes some time to
restore

Concepts

Disaster Recovery
 Allows you to work
elsewhere
 It is a “business” issue
 Covers you from: main site
failures such as electric
power or network outages,
fires, floods or building
damage
 Similar to having insurance
 Medium time to be back
online

The ideal Disaster Recovery

High Availability for
datacenters

Exact duplicate of the
main site
 Seamless operation (no
changes required)
 Same performance
 Same data

This is often very
expensive and sometimes
downright impossible

DR considerations

So, can we build a cheap(ish) DR?

We must evaluate some tradeoffs:
 What’s the cost of the service not being
available? (Murphy’s Law: accidents will happen
when you are busiest)
 Is all information equally important? Can we lose
a small amount of data?
 Can we wait until we recover certain data from
backup?
 Can I find other uses for the DR site?

DR considerations

Near or far?
 Availability
 Latency
 Legal considerations

DR considerations

Synchronous vs
Asynchronous
 Synchronous replication
requires a FAST connection
 Synchronous works at
transaction level and is
necessary for operational
systems
 Asynchronous replication
converges over time
 Asynchronous is not
affected by delays nor does
it create them

Big Data DR

Can’t generally be copied
synchronously

No VM replication

Other DR rules apply:
 Since it impacts users,
someone is in charge of the
“starting gun”
 DNS and network changes
to point clients

Main types:
 Storage replication
 Dual ingestion

Storage replication

Similar to non-Big Data solutions, where central
storage is replicated

Generally implemented using distcp and HDFS
snapshots

Data is ingested in source cluster and then copied

Storage replication

Administrative
overhead:
 Copy jobs must be
scheduled
 Metadata changes
must be tracked

Good enough for
data that comes
traditional ETLs such
as daily batches

Dual Ingestion

No files, just streams

Generally ingested from multiple outside
sources through Kafka

Streams must be directed to both sites

Dual Ingestion

Adds complexity to apps
 Nifi can be set up as a front-end to both
endpoints

Data consistency must be checked
 Can be automatically set up via monitoring
 Consolidation processes (such as a monthly
re-sync) might be needed

Others

Ingestion replication
 Variant of the dual ingestion
 A consumer is set up in the source Kafka that in turn
writes to a destination Kafka
 Bottleneck if the initial streams were generated by
many producers

Mixed:
 Previous solutions are not mutually exclusive
 Storage replication for batch processes’ results
 Dual ingestion for streams

Commercial offerings

Solutions that ease DR setup

Cloudera BDR
 Coordinates HDFS snapshots and copy

WANdisco Fusion
 Continuous storage replication

Confluent Multi-site
 Allows multi-site Kafka data replication

Tips

Big Data clusters
have many nodes
 Costly to replicate
 Performance /
Capacity tradeoff
 We can use
cheaper servers in
DR, since we don’t
expect to use them
often

Tips

Document and test procedures
 DR is rarely fully automated, so responsibilities and
actions should be clearly defined
 Plan for (at least) a yearly DR run
 Track changes in software and configuration

Tips

Once you have a DR
solution, other uses will
surface

DR site can be used for
backup
 Maintain HDFS
snapshots

DR data can be used
for testing / reporting
 Warning: it may alter
stored data

Conclusions

Balance HA / Backup / DR as needed, they
are not exclusive:
 Different costs
 Different impact

Big Data DR is different:
 Dedicated hardware
 No VMs, no storage cabin

Plan for DATA CENTRIC solutions

Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017

Similar to Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017 (20)

More from Big Data Spain

More from Big Data Spain (20)

Recently uploaded

Recently uploaded (20)

Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017