HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency

© Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDFS Erasure Code Storage:
Same Reliability at Better Storage Efficiency
June 10, 2015
Tsz Wo Nicholas Sze, Jing Zhao

About Speakers
• Tsz-Wo Nicholas Sze, Ph.D.
– Software Engineer at Hortonworks
– PMC Member at Apache Hadoop
– Active contributor/committer of HDFS
– Started in 2007
– Used Hadoop to compute Pi at the two-quadrillionth (2x1015th) bit
– It was a World Record.
• Jing Zhao, Ph.D.
– Software Engineer at Hortonworks
– PMC Member at Apache Hadoop
– Active contributor/committer of HDFS

Current HDFS Replication Strategy
• Three replicas by default
– 1st replica on local node, local rack or random node
– 2nd and 3rd replicas on the same remote rack
– 3x storage overhead
• Reliability: tolerate 2 failures
• Good data locality
• Fast block recovery
• Expensive for
– Massive data size
– Geo-distributed disaster recovery
r1
Rack I
DataNode
r2
Rack II
DataNode
r3

Erasure Coding
• k data blocks + m parity blocks (k + m)
– Example: Reed-Solomon 6+3
• Reliability: tolerate m failures
• Save disk space
• Save I/O bandwidth on the write path
b3b1 b2 P1b6b4 b5 P2 P3
6 data blocks 3 parity blocks
• 1.5x storage overhead
• Tolerate any 3 failures
Borthakur, “HDFS and Erasure Codes (HDFS-RAID)”
Fan, Tantisiriroj, Xiao and Gibson, “DiskReduce: RAID for Data-Intensive Scalable Computing”, PDSW’09

Block Reconstruction
• Block reconstruction overhead
– Higher network bandwidth cost
– Extra CPU overhead
• Local Reconstruction Codes (LRC), Hitchhiker
b4
Rack
b2
Rack
b3
Rack
b1
Rack
b6
Rack
b5
Rack RackRack
P1 P2
Rack
P3
Huang et al. Erasure Coding in Windows Azure Storage. USENIX ATC'12.
Sathiamoorthy et al. XORing elephants: novel erasure codes for big data. VLDB 2013.
Rashmi et al. A "Hitchhiker's" Guide to Fast and Efficient Data Reconstruction in Erasure-coded Data Centers. SIGCOMM'14.

Erasure Coding on Contiguous/Striped Blocks
• EC on striped blocks
– Leverage multiple disks in parallel
– Enable online Erasure Coding
– No data locality for readers
– Suitable for large files
C1 C2 C3 C4 C5 C6 PC1 PC2 PC3
C7 C8 C9 C10 C11 C12 PC4 PC5 PC6
stripe 1
stripe 2
stripe n
b1 b2 b3 b4 b5 b6 P1 P2 P3
6 Data Blocks 3 Parity Blocks
b3b1 b2 b6b4 b5
File f1
P1 P2 P3
parity blocks
File f2 f3
data blocks
• EC on existing contiguous blocks
– Offline scanning and encoding

Technical Approach
• Phase 1 (HDFS-7285, HDFS-8031)
– Erasure Coding + Striping
– Conversion between EC files and
non-EC files
• Phase 2 (HDFS-8030)
– Erasure Coding on contiguous blocks
Source:
https://issues.apache.org/jira/secure/attachment/12697210/HDFSErasureCo
dingDesign-20150206.pdf

Architecture Overview
• NameNode
– Striped block support
– Schedule block reconstruction
• DFSClient
– Striped block
– Encoding/Decoding
• DataNode
– Block reconstruction
Source:
https://issues.apache.org/jira/secure/attachment/12697210/HDFSErasureCodingDesign-
20150206.pdf

Erasure Coding Zone
• Create a zone on an empty directory
– Shell command: hdfs erasurecode –createZone [-s <schemaName>] <path>
• All the files under a zone directory are automatically erasure coded
– Rename across zones with different EC schemas are disallowed

Striped Block Groups
• NameNode (Block Manager) manages striped block groups
– Single record for a striped block group in blocksMap
– Lower memory cost
• Each block group contains k+m blocks
• Reported blocks (from DN)  striped block group
Block Group 1
(ID: b1)
Internal block
(ID = b1 + 0)
Internal block
(ID = b1 + 1)
Block Group 2
(ID: b2 = b1 + 16)
Internal block
(ID = b1 + 8)
…
DN 1 DN 2 DN 9
NameNode / Block Manager

Write Pipeline for Replicated Files
DN1 DN2 DN3
data data
ackack
Writer
data
ack
• Write pipeline
– Write to a datanode pipeline
• Durability
– Use 3 replicas to tolerate maximum 2 failures
• Visibility
– Read is supported for being written files
– Data can be made visible by hflush/hsync
• Consistency
– Client can start reading from any replica and failover to any other replica to read the same data
• Appendable
– Files can be reopened for append
* DN = DataNode

hflush & hsync
• Java flush (or C/C++ fflush)
– Forces any buffered output bytes to be written out.
• HDFS hflush
– Flush data to all the datanodes in the write pipeline
– Guarantees the data written before hflush is visible for reading
– Data may be in datanode memory
• HDFS hsync
– Hfush with local file system sync to commit data to disk.
– Option to update the file length in Namenode
– Useful with snapshots

Parallel Write for EC Files
• Parallel write
– Client writes to a group of 9 datanodes at the same time
• Durability
– (6, 3)-Reed-Solomon can tolerate maximum 3 failures
• Visibility (Same as replicated files)
– Read is supported for being written files
– Data can be made visible by hflush/hsync
• Consistency
– Client can start reading from any 6 of the 9 replicas
– When reading from a datanode fails, client can failover
to any other remaining replica to read the same data.
• Appendable (Same as replicated files)
– Files can be reopened for append
DN1
DN6
DN7
data
parity
ack
ack
Writer
data
ack
DN9
parity
ack
……

Write Failure Handling
• Datanode failure
– Client ignores the failed datanode and continue writing.
– Able to tolerate 3 failures.
– Require at least 6 datanodes.
– Missing blocks will be reconstructed later.
DN1
DN6
DN7
data
parity
ack
ack
Writer
data
ack
DN9
parity
ack
……

Slow Writers & Replace Datanode on Failure
• Write pipeline for replicated files
– Datanode can be replaced in case of failure.
• Slow writers
– A write pipeline may last for a long time.
– The probability of datanode failures increases over time.
– Need to replace datanode on failure.
• EC files
– Do not support replace-datanode-on-failure.
– Slow writer is NOT a use case.
DN1 DN4
data
ack
DN3DN2
data
ack
Writer
data
ack

Reading EC Files
• Parallel read
– Read from 6 Datanodes with data blocks
– Support both stateful read and pread
DN3
DN1
DN2
Reader
DN4
DN5
DN6
Block3
Block2
Block1
Block4
Block5
Block6

Reading with Parity Blocks
• Block reconstruction
– Read parity blocks to reconstruct missing blocks
DN3
DN7
DN1
DN2
Reader
DN4
DN5
DN6
Block3
reconstruct
Block2
Block1
Block4
Block5
Block6Parity1

DN5
Read Failure Handling
• Failure handling
– When a datanode fails, just continue reading
from any of the remaining datanodes.
Block5 DN9
Parity3
DN1
DN3Block3
Reader
Block1
Block5
DN6
Block6
DN7
Parity1
DN8
Block2
Block4
reconstruct
Parity2

Erasure Coding Phase 1 – Basic Features
• Erasure code schema
– (6,3)-Reed-Solomon
• Write to EC files
– Continue writing as long as there are at least 6 datanodes.
– No hflush/hsync
– No append/truncate
• Read from EC files
– from closed blocks
– reconstruct missing blocks
• EC block reconstruction
– Scheduled by NameNode
– Block reconstruction on DataNodes
• Namenode changes
– EC zone and striped block group support
– Fsck to show EC block information
• File conversion
– Use distcp to copy files to/from EC zones

Feature Development
• Current development
– HDFS-7285: Erasure Coding Phase 1
– 168 subtasks (137 subtasks resolved)
– HADOOP-11264: Common side changes
– 34 subtasks (27 subtasks resolved)
• Open source contributors
– Gao Rui, Hui Zheng, Jing Zhao, Kai Sasaki, Kai Zheng, Li Bo, Rakesh R, Takanobu
Asanuma, Takuya Fukudome, Tsz Wo Nicholas Sze, Uma Maheswara Rao G,
Vinayakumar B, Walter Su, Yi Liu, Yong Zhang, Zhe Zhang, …
– from Hortonworks, Yahoo! Japan, Intel, Cloudera, Huawei, …

Future Works
• Follow on works and more features
– Support hflush/hsync
– Support append/truncate
– Read from being written files
– Support more erasure code schemas
– Support contiguous layout
– Combine small files
• Future developments
– HDFS-8031: Follow on works
– 58 subtasks
– HADOOP-11842: Common side follow on works
– 13 subtasks
– HDFS-8030: Contiguous layout
– 8 subtasks

Thank you!

3-Replication vs (6,3)-Reed-Solomon
• Failure toleration
• Disk space usage
3-Replication (6,3)-RS
Maximum toleration 2 3
n bytes of data 3n 1.5n

• Name space usage
• (6,3)-RS optimization
– Use consecutive block IDs, only store the ID of the first block.
– Share the same generation stamp, only store one copy.
– Store the total size instead of individual sizes.
3-Replication (6,3)-RS (6,3)-RS optimized
1 block 1 blk + 3 loc
9 blk + 9 loc 1 bg + 9 loc
2 blocks 2 blk + 6 loc

• Number of blocks required to read the data
• Number of client-datanode connections
1 block 1
6
2 blocks 2
3 blocks 3
4 blocks 4
5 blocks 5
6 blocks 6
Write 1 9
Read 1 6

The Math Behind
• Theorem
Any n > 0 points determine a unique polynomial with degree d <= n-1.
• Polynomial over sampling
1. Consider the 6 data blocks are coordinates for i=1,…,6.
2. Compute the unique degree 5 polynomial passing all the 6 points.
3. Compute the points on the polynomial as parity blocks for j=7,8,9.
Blocki => ( i, <data> )
y = p(x)
( j, p(j) ) => Parityj

Questions?

HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency

More Related Content

What's hot

Viewers also liked

Similar to HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency

More from DataWorks Summit

Recently uploaded

HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency

Editor's Notes