SlideShare a Scribd company logo
Disaster recovery with OpenNebula
Carlo Daffara
First, let me get
some coffee.
“Disaster recovery (DR) involves a set of policies and
procedures to enable the recovery or continuation of vital
technology infrastructure and systems following a natural
or human-induced disaster. Disaster recovery focuses on
the IT or technology systems supporting critical business
functions, as opposed to business continuity, which
involves keeping all essential aspects of a business
functioning despite significant disruptive events. Disaster
recovery is therefore a subset of business continuity.”
80% of businesses affected by a major
incident either never re-open or close
within 18 months (Source: Axa)
From “Understanding the Cost of Data Center Downtime: An Analysis of the Financial Impact on Infrastructure Vulnerability”, Ponemon Research
“Let’s begin with one very interesting fact. According to a
survey completed in 2010, human error is responsible for
40% of all data loss, as compared to just 29% for hardware
or system failures. An earlier IBM study determined data
loss due to human error was as high as 80%” (From:
Business continuity and disaster recovery planning for IT
professionals”, Elsevier press, 2014)
The recovery time objective (RTO) is the targeted duration of
time and a service level within which a business process must
be restored after a disaster (or disruption) in order to avoid
unacceptable consequences associated with a break in
business continuity.
The recovery point objective (RPO), is the maximum tolerable
period in which data might be lost from an IT service due to a
major incident.
“Alternative storage-based replication solutions cost a
minimum of $10,000 per terabyte of data covered plus
ongoing maintenance. For the composite organization’s
225 protected VMs with an average size of 100 gigabytes
(GB), the three year costs for licenses and maintenance are
estimated at $328,500” (Forrester research, “The Total
Economic Impact of VMware vCenter Site Recovery
Manager”, 2013)
3 simple rules to make a working DR:
Rule 1: never put all eggs in one
basket (be it hardware, software, cloud)
Customer buys full DR and snapshot capability from local
data center; data center updates SAN firmware and loses
everything. Customer discovers that snapshots and
backups were kept in the same SAN with everything else.
In electronics, an opto-isolator, also called an optocoupler,
photocoupler, or optical isolator, is a component that transfers
electrical signals between two isolated circuits by using light.
Opto-isolators prevent high voltages from affecting the system
receiving the signal.
Rule 2: RTO and RPO are usually
different from VM to VM
Needs to be
replicated
constantly
No one cares
if this dies
Rule 3: design a reliable oracle
Oracle of
Delphi
How the others do it:
How we do it:
Our approach takes advantage of three
individual factors:
● LizardFS’ thinly-provisioned snapshots
● online replication of chunks & tiering
● OpenNebula’s datastores
# An example of configuration of goals. It contains the default values.
1 1 : _
2 2 : _ _
3 3 : _ _ _
4 4 : _ _ _ _
5 5 : _ _ _ _ _
# (...)
20 20 : _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# But you don't have to specify all of them -- defaults will be assumed.
# You can define your own custom goals using labels if you use them, e.g.:
# 14 min_two_locations: _ locationA locationB # one copy in A, one in B, third anywhere
# 15 fast_access : ssd _ _ # one copy on ssd, two additional on any
drives
# 16 two_manufacturers: WD HT # one on WD disk, one on HT disk
● Most disasters are “local”, for example a fire
in the server room or a flood
● Two different DR sites, one near (eg. next
building/other side of the building) and one
far (external datacenter)
● near DR receives a copy of the chunks that
are part of the marked datastores
● Remote snapshots are handled in the same
way: we take a full snapshot of the
datastore, and differentially replicate it
● We use the “snapshot of snapshot” approach
to avoid the cost of deduplication
● This way we can prioritize sync queues, and
in the receiving end we got a complete and
decoupled + working OpenNebula
For example, average dedup cost for ZFS: 5 to 30 GB of dedup table data for every TB of pool data, assuming an average block size of 64K.
/var/lib/one/datastore
↓
DRSNAP12H
/var/lib/one/snapshots
↓
<yyyymmddhh>
↓
DRSNAP12H
Local
VM changes only in
snapshots
/var/lib/one/datastore
↓
DRSNAP12H
/var/lib/one/snapshots
↓
<yyyymmddhh>
↓
DRSNAP12H
Remote
no chunk changes
in snapshots
inplace rsync
(25x speedup)
virsh# domblkstat instance-0012 --device vda
vda rd_req 128
vda rd_bytes 2344448
vda wr_req 234
vda wr_bytes 618496
vda flush_operations 2
vda rd_total_times 106512819
vda wr_total_times 960359872
vda flush_total_times 1741727
Our “pilot light” approach: a running OpenNebula on two
nodes, with its own LizardFS store. Running only two VMs: the
Oracle and the Tester
The Oracle checks if DR is needed, and may need a human
confirmation for execution of the DR failover. If confirmation
is given, it takes the latest valid snapshotted datastore,
softlinks it and import the VMs (through snapshots, so it’s
instantaneous)
The Tester makes a snapshot of the current stable snapshot,
import the VMs and runs them into a separate, non-routed
vnet, then executes a test to see if everything works (workload
dependent), then deletes the intermediate snapshots
Only critical VMs are executed this way, if RTO<30 mins
For the VMs with higher RTO, buy one week of hardware on
demand, auto-install a node with Puppet or Ansible, and make
it join the OpenNebula cloud
Deployed usually in 30 mins. Other vendor guarantee <15 minutes.
Ideal for harsh indoor environments that
require protection from falling dirt or liquid,
dust, light splashing, oil or coolant seepage.
Its NEMA Zone 4 rating also makes it perfect
for facilities located in earthquake-prone
seismic zones or any environment prone to
extreme vibration such as factories, power
stations, construction areas, shipping
facilities, warehouses, processing plants,
railroads, airports and military installations.
● Have a “big red button” to stop DR if
needed. Sometimes you are already fighting
fire here, and you know it’s better not to
move everything in flight.
● Have two people that are competent as DR
firefighters, and give them a second phone
with a rechargeable card. And make sure
both don’t go on vacation together. (Hint:
don’t choose two married people)
● Use a gateway machine to provide a
consistent internal IP scheme, and two
different configurations for the gateway
router to provide unmodified routing for the
remaining VMs
● Aggregate functionality in a single VM (for
example, one that manages logs) to
optimize writes
● I favor consistency, so I tend to avoid
application-level replication, unless it’s
native to the app (eg. NoSQL). Otherwise
you have different solutions for different
machines (eg. quorum group in MS
replication with same UUID…)
● Try to reduce write amplification for
databases, especially MySQL. Eg. TokuDB
and its fractal tree
Thank you!
Carlo Daffara
@cdaffara
linkedin.com/in/cdaffara

More Related Content

What's hot

What's hot (8)

Data Recovery
Data RecoveryData Recovery
Data Recovery
 
Resisting to The Shocks
Resisting to The ShocksResisting to The Shocks
Resisting to The Shocks
 
Rtos concepts
Rtos conceptsRtos concepts
Rtos concepts
 
Webinar: Eliminate Backups and Simplify DR with Hybrid Cloud Storage
Webinar: Eliminate Backups and Simplify DR with Hybrid Cloud StorageWebinar: Eliminate Backups and Simplify DR with Hybrid Cloud Storage
Webinar: Eliminate Backups and Simplify DR with Hybrid Cloud Storage
 
Data recovery
Data recoveryData recovery
Data recovery
 
Real-Time Operating Systems
Real-Time Operating SystemsReal-Time Operating Systems
Real-Time Operating Systems
 
Data recovery
Data recoveryData recovery
Data recovery
 
Real time system tsp
Real time system tspReal time system tsp
Real time system tsp
 

Similar to Disaster recovery with open nebula

OpenNebulaConf 2014 - OpenNebula and MooseFS for disaster recovery_real cloud...
OpenNebulaConf 2014 - OpenNebula and MooseFS for disaster recovery_real cloud...OpenNebulaConf 2014 - OpenNebula and MooseFS for disaster recovery_real cloud...
OpenNebulaConf 2014 - OpenNebula and MooseFS for disaster recovery_real cloud...
OpenNebula Project
 
Disaster Recovery Plan
Disaster Recovery PlanDisaster Recovery Plan
Disaster Recovery Plan
David Donovan
 
[White paper] detecting problems in industrial networks though continuous mon...
[White paper] detecting problems in industrial networks though continuous mon...[White paper] detecting problems in industrial networks though continuous mon...
[White paper] detecting problems in industrial networks though continuous mon...
TI Safe
 
MIG 5th Data Centre Summit 2016 PTS Presentation v1
MIG 5th Data Centre Summit 2016 PTS Presentation v1MIG 5th Data Centre Summit 2016 PTS Presentation v1
MIG 5th Data Centre Summit 2016 PTS Presentation v1
blewington
 

Similar to Disaster recovery with open nebula (20)

OpenNebulaConf 2014 - OpenNebula and MooseFS for disaster recovery_real cloud...
OpenNebulaConf 2014 - OpenNebula and MooseFS for disaster recovery_real cloud...OpenNebulaConf 2014 - OpenNebula and MooseFS for disaster recovery_real cloud...
OpenNebulaConf 2014 - OpenNebula and MooseFS for disaster recovery_real cloud...
 
OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clo...
OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clo...OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clo...
OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clo...
 
Smartive STORM
Smartive STORMSmartive STORM
Smartive STORM
 
Shielding Data Assets: Exploring Data Protection and Disaster Recovery Strate...
Shielding Data Assets: Exploring Data Protection and Disaster Recovery Strate...Shielding Data Assets: Exploring Data Protection and Disaster Recovery Strate...
Shielding Data Assets: Exploring Data Protection and Disaster Recovery Strate...
 
Ch13 Business Continuity Planning and Procedures
Ch13 Business Continuity Planning and ProceduresCh13 Business Continuity Planning and Procedures
Ch13 Business Continuity Planning and Procedures
 
Locationless data science on a modern secure edge
Locationless data science on a modern secure edgeLocationless data science on a modern secure edge
Locationless data science on a modern secure edge
 
Business Continuity Presentation[1]
Business Continuity Presentation[1]Business Continuity Presentation[1]
Business Continuity Presentation[1]
 
Disaster recovery glossary
Disaster recovery glossaryDisaster recovery glossary
Disaster recovery glossary
 
Real Time Operating system (RTOS) - Embedded systems
Real Time Operating system (RTOS) - Embedded systemsReal Time Operating system (RTOS) - Embedded systems
Real Time Operating system (RTOS) - Embedded systems
 
Brochure triconex emergency_shutdownsystemssolutions_03-10
Brochure triconex emergency_shutdownsystemssolutions_03-10Brochure triconex emergency_shutdownsystemssolutions_03-10
Brochure triconex emergency_shutdownsystemssolutions_03-10
 
Business Continuity Presentation
Business Continuity PresentationBusiness Continuity Presentation
Business Continuity Presentation
 
DATA CENTER
DATA CENTER DATA CENTER
DATA CENTER
 
Audax Group: CIO Perspectives - Managing The Copy Data Explosion
Audax Group: CIO Perspectives - Managing The Copy Data ExplosionAudax Group: CIO Perspectives - Managing The Copy Data Explosion
Audax Group: CIO Perspectives - Managing The Copy Data Explosion
 
Disaster Recovery Plan
Disaster Recovery PlanDisaster Recovery Plan
Disaster Recovery Plan
 
RTOS implementation
RTOS implementationRTOS implementation
RTOS implementation
 
Joe Graziano – Challenge 2 Design Solution (Part 1)
Joe Graziano – Challenge 2 Design Solution (Part 1)Joe Graziano – Challenge 2 Design Solution (Part 1)
Joe Graziano – Challenge 2 Design Solution (Part 1)
 
Dataloggers seminar Report
Dataloggers seminar ReportDataloggers seminar Report
Dataloggers seminar Report
 
[White paper] detecting problems in industrial networks though continuous mon...
[White paper] detecting problems in industrial networks though continuous mon...[White paper] detecting problems in industrial networks though continuous mon...
[White paper] detecting problems in industrial networks though continuous mon...
 
Cloud Busting: Understanding Cloud-based Digital Forensics
Cloud Busting: Understanding Cloud-based Digital ForensicsCloud Busting: Understanding Cloud-based Digital Forensics
Cloud Busting: Understanding Cloud-based Digital Forensics
 
MIG 5th Data Centre Summit 2016 PTS Presentation v1
MIG 5th Data Centre Summit 2016 PTS Presentation v1MIG 5th Data Centre Summit 2016 PTS Presentation v1
MIG 5th Data Centre Summit 2016 PTS Presentation v1
 

More from Carlo Daffara

Economic impact of open source software
Economic impact of open source softwareEconomic impact of open source software
Economic impact of open source software
Carlo Daffara
 

More from Carlo Daffara (20)

mindtrek2016 - the economics of open source clouds
mindtrek2016 - the economics of open source cloudsmindtrek2016 - the economics of open source clouds
mindtrek2016 - the economics of open source clouds
 
Economics of public and private clouds
Economics of public and private cloudsEconomics of public and private clouds
Economics of public and private clouds
 
Cloudexpoeurope open source cloud
Cloudexpoeurope open source cloudCloudexpoeurope open source cloud
Cloudexpoeurope open source cloud
 
Class conference 2014 daffara
Class conference 2014   daffaraClass conference 2014   daffara
Class conference 2014 daffara
 
Collaborative economics
Collaborative economicsCollaborative economics
Collaborative economics
 
Daffara economics
Daffara economicsDaffara economics
Daffara economics
 
Making clouds: turning opennebula into a product
Making clouds: turning opennebula into a productMaking clouds: turning opennebula into a product
Making clouds: turning opennebula into a product
 
Da zero al cloud
Da zero al cloudDa zero al cloud
Da zero al cloud
 
Nonsoftwareoss
NonsoftwareossNonsoftwareoss
Nonsoftwareoss
 
Cloud
CloudCloud
Cloud
 
Businessonopen2012
Businessonopen2012Businessonopen2012
Businessonopen2012
 
Economic value of open source
Economic value of open sourceEconomic value of open source
Economic value of open source
 
Economic impact of open source software
Economic impact of open source softwareEconomic impact of open source software
Economic impact of open source software
 
Mythrealities
MythrealitiesMythrealities
Mythrealities
 
Transfersummit2011
Transfersummit2011Transfersummit2011
Transfersummit2011
 
Owf2010 daffara
Owf2010 daffaraOwf2010 daffara
Owf2010 daffara
 
Linuxtag daffara
Linuxtag daffaraLinuxtag daffara
Linuxtag daffara
 
Oss healthcare
Oss healthcareOss healthcare
Oss healthcare
 
Empoweringsme
EmpoweringsmeEmpoweringsme
Empoweringsme
 
Ipross
IprossIpross
Ipross
 

Recently uploaded

How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
Alluxio, Inc.
 

Recently uploaded (20)

Designing for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web ServicesDesigning for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web Services
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Studiovity film pre-production and screenwriting software
Studiovity film pre-production and screenwriting softwareStudiovity film pre-production and screenwriting software
Studiovity film pre-production and screenwriting software
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
Breaking the Code : A Guide to WhatsApp Business API.pdf
Breaking the Code : A Guide to WhatsApp Business API.pdfBreaking the Code : A Guide to WhatsApp Business API.pdf
Breaking the Code : A Guide to WhatsApp Business API.pdf
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfA Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Kraków
 

Disaster recovery with open nebula

  • 1. Disaster recovery with OpenNebula Carlo Daffara
  • 2. First, let me get some coffee.
  • 3.
  • 4.
  • 5.
  • 6. “Disaster recovery (DR) involves a set of policies and procedures to enable the recovery or continuation of vital technology infrastructure and systems following a natural or human-induced disaster. Disaster recovery focuses on the IT or technology systems supporting critical business functions, as opposed to business continuity, which involves keeping all essential aspects of a business functioning despite significant disruptive events. Disaster recovery is therefore a subset of business continuity.”
  • 7. 80% of businesses affected by a major incident either never re-open or close within 18 months (Source: Axa)
  • 8. From “Understanding the Cost of Data Center Downtime: An Analysis of the Financial Impact on Infrastructure Vulnerability”, Ponemon Research
  • 9. “Let’s begin with one very interesting fact. According to a survey completed in 2010, human error is responsible for 40% of all data loss, as compared to just 29% for hardware or system failures. An earlier IBM study determined data loss due to human error was as high as 80%” (From: Business continuity and disaster recovery planning for IT professionals”, Elsevier press, 2014)
  • 10.
  • 11.
  • 12.
  • 13. The recovery time objective (RTO) is the targeted duration of time and a service level within which a business process must be restored after a disaster (or disruption) in order to avoid unacceptable consequences associated with a break in business continuity. The recovery point objective (RPO), is the maximum tolerable period in which data might be lost from an IT service due to a major incident.
  • 14. “Alternative storage-based replication solutions cost a minimum of $10,000 per terabyte of data covered plus ongoing maintenance. For the composite organization’s 225 protected VMs with an average size of 100 gigabytes (GB), the three year costs for licenses and maintenance are estimated at $328,500” (Forrester research, “The Total Economic Impact of VMware vCenter Site Recovery Manager”, 2013)
  • 15. 3 simple rules to make a working DR:
  • 16. Rule 1: never put all eggs in one basket (be it hardware, software, cloud)
  • 17.
  • 18. Customer buys full DR and snapshot capability from local data center; data center updates SAN firmware and loses everything. Customer discovers that snapshots and backups were kept in the same SAN with everything else.
  • 19.
  • 20. In electronics, an opto-isolator, also called an optocoupler, photocoupler, or optical isolator, is a component that transfers electrical signals between two isolated circuits by using light. Opto-isolators prevent high voltages from affecting the system receiving the signal.
  • 21.
  • 22. Rule 2: RTO and RPO are usually different from VM to VM
  • 23.
  • 24.
  • 25. Needs to be replicated constantly No one cares if this dies
  • 26.
  • 27.
  • 28. Rule 3: design a reliable oracle
  • 29.
  • 30.
  • 32. How the others do it:
  • 33.
  • 34.
  • 35. How we do it:
  • 36.
  • 37. Our approach takes advantage of three individual factors: ● LizardFS’ thinly-provisioned snapshots ● online replication of chunks & tiering ● OpenNebula’s datastores
  • 38.
  • 39.
  • 40. # An example of configuration of goals. It contains the default values. 1 1 : _ 2 2 : _ _ 3 3 : _ _ _ 4 4 : _ _ _ _ 5 5 : _ _ _ _ _ # (...) 20 20 : _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ # But you don't have to specify all of them -- defaults will be assumed. # You can define your own custom goals using labels if you use them, e.g.: # 14 min_two_locations: _ locationA locationB # one copy in A, one in B, third anywhere # 15 fast_access : ssd _ _ # one copy on ssd, two additional on any drives # 16 two_manufacturers: WD HT # one on WD disk, one on HT disk
  • 41. ● Most disasters are “local”, for example a fire in the server room or a flood ● Two different DR sites, one near (eg. next building/other side of the building) and one far (external datacenter) ● near DR receives a copy of the chunks that are part of the marked datastores
  • 42.
  • 43. ● Remote snapshots are handled in the same way: we take a full snapshot of the datastore, and differentially replicate it ● We use the “snapshot of snapshot” approach to avoid the cost of deduplication ● This way we can prioritize sync queues, and in the receiving end we got a complete and decoupled + working OpenNebula For example, average dedup cost for ZFS: 5 to 30 GB of dedup table data for every TB of pool data, assuming an average block size of 64K.
  • 44. /var/lib/one/datastore ↓ DRSNAP12H /var/lib/one/snapshots ↓ <yyyymmddhh> ↓ DRSNAP12H Local VM changes only in snapshots /var/lib/one/datastore ↓ DRSNAP12H /var/lib/one/snapshots ↓ <yyyymmddhh> ↓ DRSNAP12H Remote no chunk changes in snapshots inplace rsync (25x speedup)
  • 45.
  • 46. virsh# domblkstat instance-0012 --device vda vda rd_req 128 vda rd_bytes 2344448 vda wr_req 234 vda wr_bytes 618496 vda flush_operations 2 vda rd_total_times 106512819 vda wr_total_times 960359872 vda flush_total_times 1741727
  • 47. Our “pilot light” approach: a running OpenNebula on two nodes, with its own LizardFS store. Running only two VMs: the Oracle and the Tester The Oracle checks if DR is needed, and may need a human confirmation for execution of the DR failover. If confirmation is given, it takes the latest valid snapshotted datastore, softlinks it and import the VMs (through snapshots, so it’s instantaneous) The Tester makes a snapshot of the current stable snapshot, import the VMs and runs them into a separate, non-routed vnet, then executes a test to see if everything works (workload dependent), then deletes the intermediate snapshots
  • 48. Only critical VMs are executed this way, if RTO<30 mins For the VMs with higher RTO, buy one week of hardware on demand, auto-install a node with Puppet or Ansible, and make it join the OpenNebula cloud Deployed usually in 30 mins. Other vendor guarantee <15 minutes.
  • 49.
  • 50.
  • 51. Ideal for harsh indoor environments that require protection from falling dirt or liquid, dust, light splashing, oil or coolant seepage. Its NEMA Zone 4 rating also makes it perfect for facilities located in earthquake-prone seismic zones or any environment prone to extreme vibration such as factories, power stations, construction areas, shipping facilities, warehouses, processing plants, railroads, airports and military installations.
  • 52.
  • 53.
  • 54. ● Have a “big red button” to stop DR if needed. Sometimes you are already fighting fire here, and you know it’s better not to move everything in flight. ● Have two people that are competent as DR firefighters, and give them a second phone with a rechargeable card. And make sure both don’t go on vacation together. (Hint: don’t choose two married people)
  • 55. ● Use a gateway machine to provide a consistent internal IP scheme, and two different configurations for the gateway router to provide unmodified routing for the remaining VMs ● Aggregate functionality in a single VM (for example, one that manages logs) to optimize writes
  • 56. ● I favor consistency, so I tend to avoid application-level replication, unless it’s native to the app (eg. NoSQL). Otherwise you have different solutions for different machines (eg. quorum group in MS replication with same UUID…) ● Try to reduce write amplification for databases, especially MySQL. Eg. TokuDB and its fractal tree
  • 57.