This document provides an overview of architecting data protection with Rubrik presented by Andrew Miller and Rebecca Fitzhugh. It discusses key considerations for disaster recovery planning like business impact analyses, service level agreements, recovery point and recovery time objectives. It introduces Rubrik's approach to data management which aims to simplify architectures using a software-defined fabric. The presentation demonstrates Rubrik's capabilities for rapid data ingestion, intelligent SLA policies, instant recovery of VMs and files, and integration with public clouds.
2. • This presentation may contain product features that are currently under development.
• This overview of new technology represents no commitment from VMware to deliver these
features in any generally available product.
• Features are subject to change, and must not be included in contracts, purchase orders, or
sales agreements of any kind.
• Technical feasibility and market demand will affect final delivery.
• Pricing and packaging for any new technologies or features discussed or presented have not
been determined.
Disclaimer
2
3. Rebecca Fitzhugh
Tweet
Blogger
Co-Host
I have a job!
Author
VMware
@ rebeccafitzhugh
@ technicloud.com
@ vbrownbag.com
@ Rubrik.com
vSphere Virtual Machine Management
Learning VMware vSphere
VCDX #243
5. Agenda? Nah…
Share Data Protection
Architecture Knowledge
(more than half)
Show Where Rubrik Fits
Technically + Demo
(less than half)
Fair?
(Q&A Too)
6. Why bother? One big reason…
Business Expectations
Of
Disaster Recovery /
Data Protection
IT Capabilities
For
Disaster Recovery /
Data Protection
!=!=
7. What Are You Really Protecting Yourself Against?
• Lost or postponed sales and income
• Regulatory fines
• Delay of new business plans
• Loss of contractual bonuses
• Customer dissatisfaction
• Timing and duration of disruption
• Increased expenses such as overtime labor and outsourcing
• Employee Burnout
8. What is a Disaster?
Disaster: An event that affects a service or system such that significant effort is required to restore
the original performance level.
• But what does that look like IN OUR
ENVIRONMENT?
• What disaster and recovery scenarios
should we plan for?
17. What is the most common scenario for disaster?
19
18. What is a Disaster?
Disaster: An event that affects a service or system such that significant effort is required to restore
the original performance level.
• But what does that look like IN OUR
ENVIRONMENT?
• What disaster and recovery scenarios
should we plan for?
• Where do we begin?
• How do we do it?
19. What is a Business Impact Analysis (BIA)?
• A process to understand:
– What is the monetary impact of a disaster or failure?
– What are the most time-critical and information-critical
business processes?
– How does the business REALLY rely upon IT Service and
Application availability?
– What availability or recoverability capabilities are
justifiable based on these requirements, potential impact,
and costs?
• Composed of two components
– Technical Discovery – Data Gathering
– Human Conversation – Talk to People!
20. Example Output – Priority Tiers
Priority Tier Description
Priority 1
High Availability /
Immediate Recovery
Services whose unavailability more than a brief period can have a severe impact
on customers or time-critical business operations.
Priority 2
1-2 day recovery
Services whose unavailability significantly impacts customers or business
operations.
Priority 3
3-5 day recovery
Services which can tolerate up to five days of disruption in a disaster.
Priority 4
6-10 day recovery
Services which can tolerate up to ten days of disruption in a disaster.
Priority 3 and 4 systems may be restored in less time, depending on the situation.
However, higher priority functions will be restored first.
Priority 5
“Best effort” recovery
Non-critical services which can tolerate two weeks or more of disruption in a
disaster. These systems will be restored on a best-effort basis, after other more
critical systems have been restored and ongoing operations have resumed.
Priority 5 systems may be restored in less time, depending on the situation.
However, higher priority functions will be restored first. In some cases, systems
deemed to not be required for continued operations may not be restored.
21. What is an SLA?
• A contract between an external service provider and its customers or between
an IT department and the internal business units it serves.
23
22. What is an SLA?
• Two 9’s – 99% = 3.65 days of downtime per year (easy to achieve, less expensive)
• Three 9’s – 99.9% = 8.76 hours of downtime per year
• Four 9’s – 99.99% = 52.6 minutes of downtime per year
• Five 9’s – 99.999% = 5.26 minutes of downtime per year (difficult to achieve, expensive!)
24
23. DECLARE
DISASTER
10 a.m.
Recovery Point Objectives
(RPO)
Recovery Time Objectives
(RTO)
RPO: Amount of data lost from
failure, measured as the amount
of time from a disaster event
RTO: Targeted amount of time
to restart a business service
after a disaster event
5
a.m.
6
a.m.
7
a.m.
8
a.m.
9
a.m.
10
a.m.
11
a.m.
12
a.m.
1
p.m.
2
p.m.
3
p.m.
4
p.m.
5
p.m.
6
p.m.
7
p.m.
Disaster Recovery: Key Measures
24. Cost
Disaster Recovery: Key Measures
Weeks Days Hours Minutes Seconds WeeksDaysHoursMinutesSeconds
Recovery Point Recovery Time
Real Time
25. BC vs DR vs OR – Say What?
• Business Continuity
– All goes on as normal despite an incident
– Could lose a site and have no impact on business operations (active/active sites)
• Disaster Recovery
– To cope with & recover from an IT crisis that moves work to an alternative system in a non-routine way.
– A real “disaster” is large in scope and impact
– DR typically implies failure of the primary data center and recovery to an alternate site
• Operational Recovery
– Addresses more “routine” types of failures (server, network, storage, etc.)
– Events are smaller in scope and impact than a full disaster
– Typically implies recovering to alternate equipment within the primary data center
• Each should have its own clearly defined objectives – at minimum know the difference.
27. 29
Complexity is the Enemy
Whatever you do. Whatever you buy.
Simplify your Architecture & Expect More
28. Key Evaluation Criteria
What we’ve seen that makes a difference…
1. Reliability of Data Recovery
a. Simplicity of Setup and Day 2 Operations – SLA Policies!
30
29. 31
Data Management: 1990s to Present
1990s – Present
Backup &
Replication
Software
Backup Storage
Backup
Software
Backup
Servers
Backup
Proxies
Replication Catalog
Database
Tape Off-site ArchiveBackup Storage
a
Dedupe
Metadata
2000s – Present
Data Management: 2000s to Present
31. 33
Meet Rubrik Cloud Data Management
Backup
Software
Backup
Servers
Backup
Proxies
Replication Catalog
Database
Tape Off-site ArchiveBackup Storage
a
Dedupe
Metadata
Private Public
Software fabric for orchestrating apps and data across clouds. No forklift upgrades.
32. 35
How It Works
Quick Start: Rack and go. Auto-discovery.
Rapid Ingest: Flash-optimized, parallel ingest
accelerates snapshots and eliminates stun.
Content-aware dedupe. One global namespace.
Automate: Intelligent SLA policy engine for
effortless management.
Instant Recovery: Live Mount VMs & SQL.
Instant search and file restore.
Secure: End-to-end encryption. Immutability to
fight Ransomware.
Cloud: “CloudOut” instantly accessible with global
search. Launch apps with “CloudOn” for DR or
test/dev. Run apps in cloud.
Primary Environment
SLA Policy Engine
Log Management
Private Public
NAS
AHV Hyper-V
VMware VMwareVMware VMwareVMware VMware
33. 36
Your Data Center Today
Backup Proxy
SAN
Production Servers
Backup Server
Search Server
Disk-Based
Backup
Tape Archive Offsite
Tape Vault
34. 37
Rubrik Simplifies Your Data Center
SAN
Production Servers
Scale Out
Scale Out Rubrik
Replication + Long-Term
Retention + Search
Private
35. Data Management in the Cloud
38
On-Premises
Applications & Data
Storage
Azure Instance
Blob
Storage
Backup
Replication
Archival
Analytics
Rubrik
Cloud-Native
Applications & Data
EC2 Instance
Rubrik
36. 39
Recovery Point Objective (RPO)
Availability Duration (Retention)
When to Archive (RTO)
Replication Schedule (DR)
{SLA
38. Key Evaluation Criteria
What we’ve seen that makes a difference…
1. Reliability of Data Recovery
a. Simplicity of Setup and Day 2 Operations – SLA Policies!
b. Immutability – is your data there there when you need it?
41
39. Under the Hood
42
“The Interface”
“The Logic”
“The Core”
Distributed Task Framework
Callisto
Distributed Metadata Service
Cluster Management
Global Search
Cerebro
Data Management
Crystal
UI / API
Infinity
Ecosystem
Integration
Thor
Cloud Connect
Atlas
Cloud-Scale File System
NFS
40. Key Evaluation Criteria
What we’ve seen that makes a difference…
1. Reliability of Data Recovery
a. Simplicity of Setup and Day 2 Operations – SLA Policies!
b. Immutability – is your data there there when you need it?
2. Speed of Data Recovery
a. Search + Live Mount
43
42. Rubrik Backup / Recovery + DR
45
SAN
Production Servers
Replication + Long-Term
Retention + Search
DR Servers
Rubrik
Backup S/W + Dedupe Storage
Rubrik
Replication & DR
Private
43. Key Evaluation Criteria
What we’ve seen that makes a difference…
1. Reliability of Data Recovery
a. Simplicity of Setup and Day 2 Operations – SLA Policies!
b. Immutability – is your data there there when you need it?
2. Speed of Data Recovery
a. Search + Live Mount
b. API Usage / Automation to enhance restore capabilities
46
44. Oh… By the Way
47
Your App
Use an API-first platform to create powerful automation workflows that can
be integrated with any service that supports outbound REST
Now OpenAPI