Data protection for hadoop environments

1
DATA PROTECTION FOR
HADOOP ENVIRONMENTS
PETER MARELAS
PRINCIPAL SYSTEMS ENGINEER
DATA PROTECTION SOLUTIONS
EMC

2
• How to protect Data in Hadoop environments?
• Do we need Data Protection for Hadoop?
• What motivates people to question whether they need
to protect Hadoop?
HOW DID I GET HERE?

3
• Major backup vendors don’t have solutions
• Hadoop size and scale is a challenge
• Hadoop has inbuilt Data Protection properties
WHAT I FOUND

4
Are Hadoop’s inbuilt Data Protection
properties good enough?
QUESTION TO EXPLORE

5
ARCHITECTURE CONSTRAINTS
Traditional Enterprise Application Infrastructure

6
ARCHITECTURE CONSTRAINTS
Enterprise Hadoop Infrastructure

7
Efficient
Server-Centric
Data Protection
for
traditional
Hadoop architecture

8
Are Hadoop’s inbuilt
Data Protection
properties
good enough..

9
• Onboard Data Protection methods
– Built into HDFS
– Captive
• Offboard Data Protection methods
– Getting copies of data out of Hadoop
HADOOP INBUILT DATA PROTECTION

10
ONBOARD DATA PROTECTION
Access Layer
Redundancy
NameNode HA
Redundant
Storage Controllers
Persistence Layer
Redundancy
N-way Replication
RAID/EC Schemes

11
• Proactive Data Protection
• HDFS does not assume data stays correct
• Protects against data corruption
• Verify integrity and repair from replica copies

12
• HDFS Snapshots
• Read only
• Directory level
• Not consistent at time of snapshot
• Preserves consistency on file close (beware open files!)
• Data owner controls the snapshot

13
• HDFS Trash (recycle bin)
• Moves deleted files to user trash bin
• Deleted after predefined time
• Implemented in HDFS client
• Can be overridden by user
• Trash bin can be accessed or moved back

14
• Distributed Copy
• HDFS, S3, OpenStack Swift, FTP, Azure (2.7.0)
• Single file copy performance bound to one data node
• 10TB file @ 1 Gbe = 22 hours
OFFBOARD DATA PROTECTION

15
To answer the question..
Is Hadoop inbuilt data
protection good enough we
need to understand..
What are we protecting
against…

17
There is no such thing as software
that does not unexpectedly fail

18
In 2009 HortonWorks examined
HDFS’s data integrity at Yahoo!
HDFS lost 650 blocks out of
329 million blocks on 10 clusters
with 20,000 nodes
85% due to software bugs
15% due to single block replica

19
Condition that causes
blocks to be lost
HDFS-5042

20
HDFS now supports truncate()
No longer immutable
or write-once
HDFS-3107

21
Plan for software failures..
THE MORAL OF THE STORY
Plan for human failures..

22
Not all data is equal..
Protect what is valuable..
Protect what can’t be derived
in a reasonable timeframe..
THE MORAL OF THE STORY

23
DATA PROTECTION GUIDING PRINCIPALS

24
Diversify
Loosely Coupling

25
Logical Isolation
Physical Isolation
Separation of Concerns

26
Frequently Verified

27
DATA PROTECTION STRATEGY
Versioned
Copies
(HDFS->Other)
Copy/Repl
(HDFS->HDFS)
HDFS Snapshot
HDFS Trash
Critical
Essential
Necessary
Desirable
Protection Method Data Value

28
LIVE DEMO
Hadoop Data Protection
~
Scalable Versioned Copies
~
Data Domain Protection Storage

29
(b) www.beebotech.com.au
(t) @pmarelas
(e) peter.marelas@emc.com
THANK YOU

Data protection for hadoop environments

More Related Content

What's hot

Viewers also liked

Similar to Data protection for hadoop environments

More from DataWorks Summit

Recently uploaded

Data protection for hadoop environments

Editor's Notes