Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency

2,624 views

Published on

Hadoop Summit 2015

Published in: Technology

HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency

  1. 1. Page1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HDFS Erasure Code Storage: Same Reliability at Better Storage Efficiency June 10, 2015 Tsz Wo Nicholas Sze, Jing Zhao
  2. 2. Page2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved About Speakers • Tsz-Wo Nicholas Sze, Ph.D. – Software Engineer at Hortonworks – PMC Member at Apache Hadoop – Active contributor/committer of HDFS – Started in 2007 – Used Hadoop to compute Pi at the two-quadrillionth (2x1015th) bit – It was a World Record. • Jing Zhao, Ph.D. – Software Engineer at Hortonworks – PMC Member at Apache Hadoop – Active contributor/committer of HDFS
  3. 3. Page3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Current HDFS Replication Strategy • Three replicas by default – 1st replica on local node, local rack or random node – 2nd and 3rd replicas on the same remote rack – 3x storage overhead • Reliability: tolerate 2 failures • Good data locality • Fast block recovery • Expensive for – Massive data size – Geo-distributed disaster recovery r1 Rack I DataNode r2 Rack II DataNode r3
  4. 4. Page4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Erasure Coding • k data blocks + m parity blocks (k + m) – Example: Reed-Solomon 6+3 • Reliability: tolerate m failures • Save disk space • Save I/O bandwidth on the write path b3b1 b2 P1b6b4 b5 P2 P3 6 data blocks 3 parity blocks • 1.5x storage overhead • Tolerate any 3 failures Borthakur, “HDFS and Erasure Codes (HDFS-RAID)” Fan, Tantisiriroj, Xiao and Gibson, “DiskReduce: RAID for Data-Intensive Scalable Computing”, PDSW’09
  5. 5. Page5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Block Reconstruction • Block reconstruction overhead – Higher network bandwidth cost – Extra CPU overhead • Local Reconstruction Codes (LRC), Hitchhiker b4 Rack b2 Rack b3 Rack b1 Rack b6 Rack b5 Rack RackRack P1 P2 Rack P3 Huang et al. Erasure Coding in Windows Azure Storage. USENIX ATC'12. Sathiamoorthy et al. XORing elephants: novel erasure codes for big data. VLDB 2013. Rashmi et al. A "Hitchhiker's" Guide to Fast and Efficient Data Reconstruction in Erasure-coded Data Centers. SIGCOMM'14.
  6. 6. Page6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Erasure Coding on Contiguous/Striped Blocks • EC on striped blocks – Leverage multiple disks in parallel – Enable online Erasure Coding – No data locality for readers – Suitable for large files C1 C2 C3 C4 C5 C6 PC1 PC2 PC3 C7 C8 C9 C10 C11 C12 PC4 PC5 PC6 stripe 1 stripe 2 stripe n b1 b2 b3 b4 b5 b6 P1 P2 P3 6 Data Blocks 3 Parity Blocks b3b1 b2 b6b4 b5 File f1 P1 P2 P3 parity blocks File f2 f3 data blocks • EC on existing contiguous blocks – Offline scanning and encoding
  7. 7. Page7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Technical Approach • Phase 1 (HDFS-7285, HDFS-8031) – Erasure Coding + Striping – Conversion between EC files and non-EC files • Phase 2 (HDFS-8030) – Erasure Coding on contiguous blocks Source: https://issues.apache.org/jira/secure/attachment/12697210/HDFSErasureCo dingDesign-20150206.pdf
  8. 8. Page8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Architecture Overview • NameNode – Striped block support – Schedule block reconstruction • DFSClient – Striped block – Encoding/Decoding • DataNode – Block reconstruction Source: https://issues.apache.org/jira/secure/attachment/12697210/HDFSErasureCodingDesign- 20150206.pdf
  9. 9. Page9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Erasure Coding Zone • Create a zone on an empty directory – Shell command: hdfs erasurecode –createZone [-s <schemaName>] <path> • All the files under a zone directory are automatically erasure coded – Rename across zones with different EC schemas are disallowed
  10. 10. Page10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Striped Block Groups • NameNode (Block Manager) manages striped block groups – Single record for a striped block group in blocksMap – Lower memory cost • Each block group contains k+m blocks • Reported blocks (from DN)  striped block group Block Group 1 (ID: b1) Internal block (ID = b1 + 0) Internal block (ID = b1 + 1) Block Group 2 (ID: b2 = b1 + 16) Internal block (ID = b1 + 8) … DN 1 DN 2 DN 9 NameNode / Block Manager
  11. 11. Page11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Write Pipeline for Replicated Files DN1 DN2 DN3 data data ackack Writer data ack • Write pipeline – Write to a datanode pipeline • Durability – Use 3 replicas to tolerate maximum 2 failures • Visibility – Read is supported for being written files – Data can be made visible by hflush/hsync • Consistency – Client can start reading from any replica and failover to any other replica to read the same data • Appendable – Files can be reopened for append * DN = DataNode
  12. 12. Page12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved hflush & hsync • Java flush (or C/C++ fflush) – Forces any buffered output bytes to be written out. • HDFS hflush – Flush data to all the datanodes in the write pipeline – Guarantees the data written before hflush is visible for reading – Data may be in datanode memory • HDFS hsync – Hfush with local file system sync to commit data to disk. – Option to update the file length in Namenode – Useful with snapshots
  13. 13. Page13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Parallel Write for EC Files • Parallel write – Client writes to a group of 9 datanodes at the same time • Durability – (6, 3)-Reed-Solomon can tolerate maximum 3 failures • Visibility (Same as replicated files) – Read is supported for being written files – Data can be made visible by hflush/hsync • Consistency – Client can start reading from any 6 of the 9 replicas – When reading from a datanode fails, client can failover to any other remaining replica to read the same data. • Appendable (Same as replicated files) – Files can be reopened for append DN1 DN6 DN7 data parity ack ack Writer data ack DN9 parity ack ……
  14. 14. Page14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Write Failure Handling • Datanode failure – Client ignores the failed datanode and continue writing. – Able to tolerate 3 failures. – Require at least 6 datanodes. – Missing blocks will be reconstructed later. DN1 DN6 DN7 data parity ack ack Writer data ack DN9 parity ack ……
  15. 15. Page15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Slow Writers & Replace Datanode on Failure • Write pipeline for replicated files – Datanode can be replaced in case of failure. • Slow writers – A write pipeline may last for a long time. – The probability of datanode failures increases over time. – Need to replace datanode on failure. • EC files – Do not support replace-datanode-on-failure. – Slow writer is NOT a use case. DN1 DN4 data ack DN3DN2 data ack Writer data ack
  16. 16. Page16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Reading EC Files • Parallel read – Read from 6 Datanodes with data blocks – Support both stateful read and pread DN3 DN1 DN2 Reader DN4 DN5 DN6 Block3 Block2 Block1 Block4 Block5 Block6
  17. 17. Page17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Reading with Parity Blocks • Block reconstruction – Read parity blocks to reconstruct missing blocks DN3 DN7 DN1 DN2 Reader DN4 DN5 DN6 Block3 reconstruct Block2 Block1 Block4 Block5 Block6Parity1
  18. 18. Page18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved DN5 Read Failure Handling • Failure handling – When a datanode fails, just continue reading from any of the remaining datanodes. Block5 DN9 Parity3 DN1 DN3Block3 Reader Block1 Block5 DN6 Block6 DN7 Parity1 DN8 Block2 Block4 reconstruct Parity2
  19. 19. Page19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Erasure Coding Phase 1 – Basic Features • Erasure code schema – (6,3)-Reed-Solomon • Write to EC files – Continue writing as long as there are at least 6 datanodes. – No hflush/hsync – No append/truncate • Read from EC files – from closed blocks – reconstruct missing blocks • EC block reconstruction – Scheduled by NameNode – Block reconstruction on DataNodes • Namenode changes – EC zone and striped block group support – Fsck to show EC block information • File conversion – Use distcp to copy files to/from EC zones
  20. 20. Page20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Feature Development • Current development – HDFS-7285: Erasure Coding Phase 1 – 168 subtasks (137 subtasks resolved) – HADOOP-11264: Common side changes – 34 subtasks (27 subtasks resolved) • Open source contributors – Gao Rui, Hui Zheng, Jing Zhao, Kai Sasaki, Kai Zheng, Li Bo, Rakesh R, Takanobu Asanuma, Takuya Fukudome, Tsz Wo Nicholas Sze, Uma Maheswara Rao G, Vinayakumar B, Walter Su, Yi Liu, Yong Zhang, Zhe Zhang, … – from Hortonworks, Yahoo! Japan, Intel, Cloudera, Huawei, …
  21. 21. Page21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Future Works • Follow on works and more features – Support hflush/hsync – Support append/truncate – Read from being written files – Support more erasure code schemas – Support contiguous layout – Combine small files • Future developments – HDFS-8031: Follow on works – 58 subtasks – HADOOP-11842: Common side follow on works – 13 subtasks – HDFS-8030: Contiguous layout – 8 subtasks
  22. 22. Page22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Thank you!
  23. 23. Page23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved 3-Replication vs (6,3)-Reed-Solomon • Failure toleration • Disk space usage 3-Replication (6,3)-RS Maximum toleration 2 3 3-Replication (6,3)-RS n bytes of data 3n 1.5n
  24. 24. Page24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved 3-Replication vs (6,3)-Reed-Solomon • Name space usage • (6,3)-RS optimization – Use consecutive block IDs, only store the ID of the first block. – Share the same generation stamp, only store one copy. – Store the total size instead of individual sizes. 3-Replication (6,3)-RS (6,3)-RS optimized 1 block 1 blk + 3 loc 9 blk + 9 loc 1 bg + 9 loc 2 blocks 2 blk + 6 loc 3 blocks 3 blk + 9 loc 4 blocks 4 blk + 12 loc 5 blocks 5 blk + 15 loc 6 blocks 6 blk + 18 loc
  25. 25. Page25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved 3-Replication vs (6,3)-Reed-Solomon • Number of blocks required to read the data • Number of client-datanode connections 3-Replication (6,3)-RS 1 block 1 6 2 blocks 2 3 blocks 3 4 blocks 4 5 blocks 5 6 blocks 6 3-Replication (6,3)-RS Write 1 9 Read 1 6
  26. 26. Page26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved The Math Behind • Theorem Any n > 0 points determine a unique polynomial with degree d <= n-1. • Polynomial over sampling 1. Consider the 6 data blocks are coordinates for i=1,…,6. 2. Compute the unique degree 5 polynomial passing all the 6 points. 3. Compute the points on the polynomial as parity blocks for j=7,8,9. Blocki => ( i, <data> ) y = p(x) ( j, p(j) ) => Parityj
  27. 27. Page27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Questions?

×