1
DATA PROTECTION FOR
HADOOP ENVIRONMENTS
PETER MARELAS
PRINCIPAL SYSTEMS ENGINEER
DATA PROTECTION SOLUTIONS
EMC
2
• How to protect Data in Hadoop environments?
• Do we need Data Protection for Hadoop?
• What motivates people to question whether they need
to protect Hadoop?
HOW DID I GET HERE?
3
• Major backup vendors don’t have solutions
• Hadoop size and scale is a challenge
• Hadoop has inbuilt Data Protection properties
WHAT I FOUND
4
Are Hadoop’s inbuilt Data Protection
properties good enough?
QUESTION TO EXPLORE
5
ARCHITECTURE CONSTRAINTS
Traditional Enterprise Application Infrastructure
6
ARCHITECTURE CONSTRAINTS
Enterprise Hadoop Infrastructure
7
Efficient
Server-Centric
Data Protection
for
traditional
Hadoop architecture
8
Are Hadoop’s inbuilt
Data Protection
properties
good enough..
9
• Onboard Data Protection methods
– Built into HDFS
– Captive
• Offboard Data Protection methods
– Getting copies of data out of Hadoop
HADOOP INBUILT DATA PROTECTION
10
ONBOARD DATA PROTECTION
Access Layer
Redundancy
NameNode HA
Redundant
Storage Controllers
Persistence Layer
Redundancy
N-way Replication
RAID/EC Schemes
11
• Proactive Data Protection
• HDFS does not assume data stays correct
• Protects against data corruption
• Verify integrity and repair from replica copies
ONBOARD DATA PROTECTION
12
• HDFS Snapshots
• Read only
• Directory level
• Not consistent at time of snapshot
• Preserves consistency on file close (beware open files!)
• Data owner controls the snapshot
ONBOARD DATA PROTECTION
13
• HDFS Trash (recycle bin)
• Moves deleted files to user trash bin
• Deleted after predefined time
• Implemented in HDFS client
• Can be overridden by user
• Trash bin can be accessed or moved back
ONBOARD DATA PROTECTION
14
• Distributed Copy
• HDFS, S3, OpenStack Swift, FTP, Azure (2.7.0)
• Single file copy performance bound to one data node
• 10TB file @ 1 Gbe = 22 hours
OFFBOARD DATA PROTECTION
15
To answer the question..
Is Hadoop inbuilt data
protection good enough we
need to understand..
What are we protecting
against…
16
DATA LOSS EVENT MATRIX
17
There is no such thing as software
that does not unexpectedly fail
18
In 2009 HortonWorks examined
HDFS’s data integrity at Yahoo!
HDFS lost 650 blocks out of
329 million blocks on 10 clusters
with 20,000 nodes
85% due to software bugs
15% due to single block replica
19
Condition that causes
blocks to be lost
HDFS-5042
20
HDFS now supports truncate()
No longer immutable
or write-once
HDFS-3107
21
Plan for software failures..
THE MORAL OF THE STORY
Plan for human failures..
22
Not all data is equal..
Protect what is valuable..
Protect what can’t be derived
in a reasonable timeframe..
THE MORAL OF THE STORY
23
DATA PROTECTION GUIDING PRINCIPALS
24
Diversify
Loosely Coupling
DATA PROTECTION GUIDING PRINCIPALS
25
Logical Isolation
Physical Isolation
Separation of Concerns
DATA PROTECTION GUIDING PRINCIPALS
26
Frequently Verified
DATA PROTECTION GUIDING PRINCIPALS
27
DATA PROTECTION STRATEGY
Versioned
Copies
(HDFS->Other)
Copy/Repl
(HDFS->HDFS)
HDFS Snapshot
HDFS Trash
Critical
Essential
Necessary
Desirable
Protection Method Data Value
28
LIVE DEMO
Hadoop Data Protection
~
Scalable Versioned Copies
~
Data Domain Protection Storage
29
(b) www.beebotech.com.au
(t) @pmarelas
(e) peter.marelas@emc.com
THANK YOU

Data protection for hadoop environments

Editor's Notes

  • #2 Welcome. Peter Marelas Principal Systems Engineer for EMC Data Protection Solutions Division Today we will be learning about Data Protection for Hadoop Environments.
  • #3 Share story on how got here, didn’t know Hadoop 12 months ago So my day job involves architecting solutions for Enterprise customers. Most of the time I’m architecting solutions for mission critical workloads, like EDW, CRM, ERP. But more recently customers have been asking us how to protect data in Hadoop environments. And some customers even asked us, do we need to protect Hadoop environments. So I didn’t have all the answers.
  • #4 I spent about 1 month researching Hadoop Data Protection Here is what I found. All major backup vendors didn’t have solutions for Hadoop Hadoop’s size and scale is so daunting most customers don’t even know where to start. Hadoop has some interesting inbuilt data protection properties So I figured the first two points we could investigate and probably solve But I wanted to understand the last point before I did anything else
  • #5 And so as part of my research I wanted to answer the question.. Are Hadoop’s inbuilt data protection properties good enough? Now before we explore that questions I want to take you through some of the constraints with traditional Hadoop architectures relative to Enterprise architectures in the context of data protection
  • #6 This is a typical Enterprise application architecture. Blue boxes are the servers. Green boxes is the app storage. Two ways to create data protection copies Stream data via app servers to heterogeneous storage (grey boxes) That’s what most backup solutions do today for Enterprise apps assuming sufficient time and resources We call this a server-centric protection strategy Other option is to use versioned storage replication to create our copies and recovery points We call this a storage-centric protection strategy
  • #7 Contrast this to a standard Hadoop architecture where storage and compute are combined We cannot use storage-centric methods to protect the data – plain disk, no intelligence So the constraint is we have to drive the process process using a server-centric approach.
  • #8 Given this constraint another goal was to find an efficient method to protect Hadoop I am going to demo this approach at the end of this presentation
  • #9 So lets go back to answer this question
  • #10 Hadoop has two types of Data Protection properties that I have classified into onboard and off board methods. Onboard is concerned with protecting data without leaving the cluster Off board is about getting copies of data out of the cluster
  • #11 If we look at onboard protection first Hadoop provides redundancy at the data access layer using a Highly Available NameNode This is like having redundant storage controllers in a storage system For the persistence layer the Hadoop file system implements N-way replication across nodes and racks This is equivalent to a RAID scheme for storage systems
  • #12 Hadoop also provides proactive Data Protection HDFS does not trust disk storage Assumes disks will degrade and return the wrong data Protects against this it generates checksums, regularly verifies them and repairs corruption from replica copies
  • #13 HDFS also supports read only snapshots There are 2 caveats with them They do not behave like storage system snapshots Storage snapshots are consistent for open and closed files HDFS snapshots are consistent for closed files only If you want consistent recovery points you need to ensure files are closed before taking a snapshot Also keep this in mind Snapshots can be deleted by data owners
  • #14 HDFS has a trash feature that operates like a recycle bin Files move into trash once deleted and then removed after a predetermined time Keep in mind Implemented in HDFS client Can be emptied at any time by file owner Can be overridden by file owner If your deleting files some other way there is no trash
  • #15 So those were the onboard data protection properties that come with Hadoop. Offboard data protection is provided by Hadoop distributed copy Lets you create copies of files to various targets. HDFS, S3, Openstack, FTP, Azure, etc. Distributed copy is great as it distributes the work amongst nodes, so it can scale with your cluster However, each file copy is mapped to one node Single file copy performance is bound by the network performance of one data node Keep this in mind 
  • #16 So now we know what Hadoop provides out of the box with respect to Data Protection We need to ask the question.. What are we protecting against? And how do Hadoop’s inbuilt methods fair?
  • #17 This is a Data Loss Event Matrix I use to assess Data Protection strategies On the left we have the events that can lead to data loss To the right we have the rating Then to the right again we have features and properties applicable to the event. And to the far right concerns relative to the features relative to the event My conclusion. Hadoop fairs well when it comes to Data Corruption, Component failures, Infrastructure software failures – firmware Has risk when it comes to Operational Failures, Site Failure, User Accidents, App software failures, Malicious user events, Malware
  • #18 I am a big believer that …. software is not immune to failure Some examples
  • #19 Data integrity study @ Yahoo! 650 blocks lost out of 329 million That’s a phenomenal achievement but look at the causes 85% due to software bugs 15% due to single block replica – operator error Last one interesting. What I found is its difficult to enforce data protection standards in Hadoop You can set a default But Data owners can define their own and change them retrospectively
  • #20 Although very rare I did find one known open condition in the Apache codebase that can cause blocks to be lost
  • #21 And a new thing to keep in mind is as of 2.7 release HDFS now supports truncate operations In the past we assumed immutability = protection That assumption is no longer valid
  • #22 Moral of the story Plan for software failures Plan for human failures
  • #23 But be sensible in your approach Not all data is equal Only protect the data that is valuable And only protect what can’t be derived again in a reasonable timeframe
  • #24 Here are my three Guiding Principles for Data Protection Strategies
  • #25 Diversify your protection copies Analogy here is investments. We hedge our risk by spreading investments across asset classes We should do the same with data copies Message: Keep your protection copies on a system that is diverse and different to the source system
  • #26 We want to maintain logical and physical isolation so that problems that impact the source system do not propagate to the target system For this to be successful we need to have separation of concerns Rule: the system protecting the source data should not trust the source system
  • #27 We want our protection copies to be frequently verified We don’t want to assume data is written correctly or stays correct We need to regularly verify in an automated way
  • #28 So here is a sample strategy that you can use that aligns the protection method to the value of data. Data that is desirable protect it with HDFS trash only Data that is necessary protect it with HDFS trash and snapshots Data that is essential protect it with trash, snapshot and distributed copy to another HDFS target Data that is critical protect it with all of the above plus versioned copies to a diverse and different storage target
  • #29 Demo how you can use Hadoop distributed copy to create versioned copies to Data Domain which is diverse and different to a Hadoop cluster Data Domain is our protection storage platform Has few unique properties Does inline deduplication Strong data integrity properties Really fast at ingesting streaming data which is good for distributed copy What’s unique about this approach is we are going to use distributed copy with an efficient incremental forever technique to maintain versioned copies