Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Native erasure coding support inside hdfs presentation

568 views

Published on

from http://cdn.oreillystatic.com/en/assets/1/event/132/Native%20erasure%20coding%20support%20inside%20HDFS%20Presentation.pdf

  • Be the first to comment

Native erasure coding support inside hdfs presentation

  1. 1. HDFS Erasure Coding Zhe Zhang zhezhang@cloudera.com
  2. 2. Replication is Expensive
  3. 3. § HDFS inherits 3-way replication from Google File System - Simple, scalable and robust Replication is Expensive Replica DataNode0 DataNode1 DataNode2 Block NameNode Replica Replica
  4. 4. § HDFS inherits 3-way replication from Google File System - Simple, scalable and robust § 200% storage overhead Replication is Expensive Replica DataNode0 DataNode1 DataNode2 Block NameNode Replica Replica
  5. 5. § HDFS inherits 3-way replication from Google File System - Simple, scalable and robust § 200% storage overhead § Secondary replicas rarely accessed Replication is Expensive Replica DataNode0 DataNode1 DataNode2 Block NameNode Replica Replica
  6. 6. Erasure Coding Saves Storage
  7. 7. Erasure Coding Saves Storage § Simplified Example: storing 2 bits 1 0Replication: XOR Coding: 1 0
  8. 8. Erasure Coding Saves Storage § Simplified Example: storing 2 bits 1 01 0Replication: XOR Coding: 1 0
  9. 9. Erasure Coding Saves Storage § Simplified Example: storing 2 bits 1 01 0Replication: XOR Coding: 1 0 2 extra bits
  10. 10. Erasure Coding Saves Storage § Simplified Example: storing 2 bits 1 01 0Replication: XOR Coding: 1 0⊕ 1= 2 extra bits
  11. 11. Erasure Coding Saves Storage § Simplified Example: storing 2 bits 1 01 0Replication: XOR Coding: 1 0⊕ 1= 2 extra bits 1 extra bit
  12. 12. Erasure Coding Saves Storage § Simplified Example: storing 2 bits § Same data durability - can lose any 1 bit 1 01 0Replication: XOR Coding: 1 0⊕ 1= 2 extra bits 1 extra bit
  13. 13. Erasure Coding Saves Storage § Simplified Example: storing 2 bits § Same data durability - can lose any 1 bit § Half the storage overhead 1 01 0Replication: XOR Coding: 1 0⊕ 1= 2 extra bits 1 extra bit
  14. 14. Erasure Coding Saves Storage § Simplified Example: storing 2 bits § Same data durability - can lose any 1 bit § Half the storage overhead § Slower recovery 1 01 0Replication: XOR Coding: 1 0⊕ 1= 2 extra bits 1 extra bit
  15. 15. Erasure Coding Saves Storage
  16. 16. Erasure Coding Saves Storage § Facebook - f4 stores 65PB of BLOBs in EC
  17. 17. Erasure Coding Saves Storage § Facebook - f4 stores 65PB of BLOBs in EC § Windows Azure Storage (WAS) - A PB of new data every 1~2 days - All “sealed” data stored in EC
  18. 18. Erasure Coding Saves Storage § Facebook - f4 stores 65PB of BLOBs in EC § Windows Azure Storage (WAS) - A PB of new data every 1~2 days - All “sealed” data stored in EC § Google File System - Large portion of data stored in EC
  19. 19. Roadmap
  20. 20. Roadmap § Background of EC - Redundancy Theory - EC in Distributed Storage Systems
  21. 21. Roadmap § Background of EC - Redundancy Theory - EC in Distributed Storage Systems § HDFS-EC architecture - Choosing Block Layout - NameNode — Generalizing the Block Concept - Client — Parallel I/O - DataNode — Background Reconstruction
  22. 22. Roadmap § Background of EC - Redundancy Theory - EC in Distributed Storage Systems § HDFS-EC architecture - Choosing Block Layout - NameNode — Generalizing the Block Concept - Client — Parallel I/O - DataNode — Background Reconstruction § Hardware-accelerated Codec Framework
  23. 23. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
  24. 24. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Replica DataNode0 DataNode1 DataNode2 Block NameNode Replica Replica 3-way Replication:
  25. 25. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Replica DataNode0 DataNode1 DataNode2 Block NameNode Replica Replica 3-way Replication:
  26. 26. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Replica DataNode0 DataNode1 DataNode2 Block NameNode Replica Replica 3-way Replication: Data Durability = 2
  27. 27. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Replica DataNode0 DataNode1 DataNode2 Block NameNode Replica Replica 3-way Replication: Data Durability = 2
  28. 28. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Replica DataNode0 DataNode1 DataNode2 Block NameNode Replica Replica useful data 3-way Replication: Data Durability = 2 redundant data
  29. 29. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Replica DataNode0 DataNode1 DataNode2 Block NameNode Replica Replica useful data 3-way Replication: Data Durability = 2 Storage Efficiency = 1/3 (33%) redundant data
  30. 30. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
  31. 31. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? XOR: X Y X ⊕ Y 0 0 0 0 1 1 1 0 1 1 1 0
  32. 32. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? XOR: X Y X ⊕ Y 0 0 0 0 1 1 1 0 1 1 1 0 Y = 0 ⊕ 1 = 1
  33. 33. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? XOR: Data Durability = 1 X Y X ⊕ Y 0 0 0 0 1 1 1 0 1 1 1 0 Y = 0 ⊕ 1 = 1
  34. 34. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? XOR: Data Durability = 1 useful data redundant data X Y X ⊕ Y 0 0 0 0 1 1 1 0 1 1 1 0
  35. 35. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? XOR: Data Durability = 1 Storage Efficiency = 2/3 (67%) useful data redundant data X Y X ⊕ Y 0 0 0 0 1 1 1 0 1 1 1 0
  36. 36. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Reed-Solomon (RS):
  37. 37. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Reed-Solomon (RS):
  38. 38. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Reed-Solomon (RS): Data Durability = 2 Storage Efficiency = 4/6 (67%)
  39. 39. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Reed-Solomon (RS): Data Durability = 2 Storage Efficiency = 4/6 (67%) Very flexible!
  40. 40. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
  41. 41. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Data Durability Storage Efficiency
  42. 42. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Data Durability Storage Efficiency Single Replica
  43. 43. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Data Durability Storage Efficiency Single Replica 0
  44. 44. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Data Durability Storage Efficiency Single Replica 0 100%
  45. 45. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Data Durability Storage Efficiency Single Replica 0 100% 3-way Replication
  46. 46. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Data Durability Storage Efficiency Single Replica 0 100% 3-way Replication 2
  47. 47. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Data Durability Storage Efficiency Single Replica 0 100% 3-way Replication 2 33%
  48. 48. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Data Durability Storage Efficiency Single Replica 0 100% 3-way Replication 2 33% XOR with 6 data cells
  49. 49. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Data Durability Storage Efficiency Single Replica 0 100% 3-way Replication 2 33% XOR with 6 data cells 1
  50. 50. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Data Durability Storage Efficiency Single Replica 0 100% 3-way Replication 2 33% XOR with 6 data cells 1 86%
  51. 51. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Data Durability Storage Efficiency Single Replica 0 100% 3-way Replication 2 33% XOR with 6 data cells 1 86% RS (6,3)
  52. 52. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Data Durability Storage Efficiency Single Replica 0 100% 3-way Replication 2 33% XOR with 6 data cells 1 86% RS (6,3) 3
  53. 53. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Data Durability Storage Efficiency Single Replica 0 100% 3-way Replication 2 33% XOR with 6 data cells 1 86% RS (6,3) 3 67%
  54. 54. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Data Durability Storage Efficiency Single Replica 0 100% 3-way Replication 2 33% XOR with 6 data cells 1 86% RS (6,3) 3 67% RS (10,4)
  55. 55. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Data Durability Storage Efficiency Single Replica 0 100% 3-way Replication 2 33% XOR with 6 data cells 1 86% RS (6,3) 3 67% RS (10,4) 4
  56. 56. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Data Durability Storage Efficiency Single Replica 0 100% 3-way Replication 2 33% XOR with 6 data cells 1 86% RS (6,3) 3 67% RS (10,4) 4 71%
  57. 57. EC in Distributed Storage Block Layout: 128~256MFile 0~128M … 640~768M0~128M 128~256M
  58. 58. EC in Distributed Storage Block Layout: 128~256MFile … 640~768M 0~128 M block0 DataNode 0 0~128M 128~256M
  59. 59. EC in Distributed Storage Block Layout: File … 640~768M 0~128 M block0 DataNode 0 128~ 256M block1 DataNode 1 0~128M 128~256M
  60. 60. EC in Distributed Storage Block Layout: File … 640~768M 0~128 M block0 DataNode 0 128~ 256M block1 DataNode 1 0~128M 128~256M … 640~ 768M block5 DataNode 5
  61. 61. EC in Distributed Storage Block Layout: File … 640~768M 0~128 M block0 DataNode 0 128~ 256M block1 DataNode 1 0~128M 128~256M … 640~ 768M block5 DataNode 5 DataNode 6 … parity
  62. 62. EC in Distributed Storage Block Layout: File … 640~768M 0~128 M block0 DataNode 0 128~ 256M block1 DataNode 1 0~128M 128~256M … 640~ 768M block5 DataNode 5 DataNode 6 … parity Contiguous Layout:
  63. 63. EC in Distributed Storage Block Layout: Data Locality ! File … 640~768M 0~128 M block0 DataNode 0 128~ 256M block1 DataNode 1 0~128M 128~256M … 640~ 768M block5 DataNode 5 DataNode 6 … parity Contiguous Layout:
  64. 64. EC in Distributed Storage Block Layout: Data Locality ! Small Files " File … 640~768M 0~128 M block0 DataNode 0 128~ 256M block1 DataNode 1 0~128M 128~256M … 640~ 768M block5 DataNode 5 DataNode 6 … parity Contiguous Layout:
  65. 65. EC in Distributed Storage Block Layout: File block0 DataNode 0 block1 DataNode 1 … block5 DataNode 5 DataNode 6 … parity 0~128M 128~256M
  66. 66. EC in Distributed Storage Block Layout: File block0 DataNode 0 block1 DataNode 1 … block5 DataNode 5 DataNode 6 … parity 0~1M 1~2M 5~6M 0~128M 128~256M
  67. 67. EC in Distributed Storage Block Layout: File block0 DataNode 0 block1 DataNode 1 … block5 DataNode 5 DataNode 6 … parity 0~1M 1~2M 5~6M 6~7M 0~128M 128~256M
  68. 68. EC in Distributed Storage Block Layout: File block0 DataNode 0 block1 DataNode 1 … block5 DataNode 5 DataNode 6 … parity Striped Layout: 0~1M 1~2M 5~6M 6~7M Data Locality " Small Files ! Parallel I/O ! 0~128M 128~256M
  69. 69. EC in Distributed Storage Spectrum: Replication Erasure Coding Striping Contiguous Ceph Ceph Quancast File System Quancast File System HDFS Facebook f4 Windows Azure
  70. 70. Roadmap § Background of EC - Redundancy Theory - EC in Distributed Storage Systems § HDFS-EC architecture - Choosing Block Layout - NameNode — Generalizing the Block Concept - Client — Parallel I/O - DataNode — Background Reconstruction § Hardware-accelerated Codec Framework
  71. 71. Choosing Block Layout • Medium: 1~6 blocks• Small files: < 1 block• Assuming (6,3) coding • Large: > 6 blocks (1 group)
  72. 72. Choosing Block Layout • Medium: 1~6 blocks• Small files: < 1 block• Assuming (6,3) coding • Large: > 6 blocks (1 group) 64.61% 9.33% 26.06% 1.85%1.86% 96.29% small medium large file count space usage Top 2% files occupy ~65% space Cluster A Profile
  73. 73. Choosing Block Layout • Medium: 1~6 blocks• Small files: < 1 block• Assuming (6,3) coding • Large: > 6 blocks (1 group) 64.61% 9.33% 26.06% 1.85%1.86% 96.29% small medium large file count space usage Top 2% files occupy ~65% space Cluster A Profile 40.08% 36.03% 23.89% 2.03% 11.38% 86.59% file count space usage Top 2% files occupy ~40% space small medium large Cluster B Profile
  74. 74. Choosing Block Layout • Medium: 1~6 blocks• Small files: < 1 block• Assuming (6,3) coding • Large: > 6 blocks (1 group) 64.61% 9.33% 26.06% 1.85%1.86% 96.29% small medium large file count space usage Top 2% files occupy ~65% space Cluster A Profile 40.08% 36.03% 23.89% 2.03% 11.38% 86.59% file count space usage Top 2% files occupy ~40% space small medium large Cluster B Profile 3.20% 20.75% 76.05% 0.00%0.36% 99.64% file count space usage Dominated by small files small medium large Cluster C Profile
  75. 75. Choosing Block Layout Striping Contiguous Replication Erasure Coding Phase 1.1 Phase 1.2 Phase 2 (Future work) Phase 3 (Future work) Current HDFS
  76. 76. Generalizing Block NameNode
  77. 77. Generalizing Block NameNode Mapping Logical and Storage Blocks
  78. 78. Generalizing Block NameNode Mapping Logical and Storage Blocks Too Many Storage Blocks?
  79. 79. Generalizing Block NameNode Mapping Logical and Storage Blocks Too Many Storage Blocks? Hierarchical Naming Protocol:
  80. 80. Client Parallel Writing streamer queue streamer … streamer DataNode DataNode DataNode
  81. 81. Client Parallel Writing streamer queue streamer … streamer DataNode DataNode DataNode
  82. 82. Client Parallel Writing streamer queue streamer … streamer DataNode DataNode DataNode
  83. 83. Client Parallel Writing streamer queue streamer … streamer DataNode DataNode DataNode
  84. 84. Client Parallel Writing streamer queue streamer … streamer DataNode DataNode DataNode Coordinator
  85. 85. Client Parallel Writing streamer queue streamer … streamer DataNode DataNode DataNode Coordinator
  86. 86. Client Parallel Writing streamer queue streamer … streamer DataNode DataNode DataNode Coordinator
  87. 87. Client Parallel Writing streamer queue streamer … streamer DataNode DataNode DataNode Coordinator
  88. 88. Client Parallel Reading … DataNodeDataNode DataNode DataNode DataNode
  89. 89. Client Parallel Reading … DataNodeDataNode DataNode DataNode DataNode
  90. 90. Client Parallel Reading … DataNodeDataNode DataNode DataNode DataNode
  91. 91. Client Parallel Reading … DataNodeDataNode DataNode DataNode DataNode parity
  92. 92. Reconstruction on DataNode § Important to avoid delay on the critical path - Especially if original data is lost § Integrated with Replication Monitor - Under-protected EC blocks scheduled together with under-replicated blocks - New priority algorithms § New ErasureCodingWorker component on DataNode
  93. 93. Roadmap § Background of EC - Redundancy Theory - EC in Distributed Storage Systems § HDFS-EC architecture - Choosing Block Layout - NameNode — Generalizing the Block Concept - Client — Parallel I/O - DataNode — Background Reconstruction § Hardware-accelerated Codec Framework
  94. 94. Acceleration with Intel ISA-L § 1 legacy coder - From Facebook’s HDFS-RAID project § 2 new coders - Pure Java — code improvement over HDFS-RAID - Native coder with Intel’s Intelligent Storage Acceleration Library (ISA-L)
  95. 95. Microbenchmark: Codec Calculation
  96. 96. Microbenchmark: HDFS I/O
  97. 97. Conclusion
  98. 98. Conclusion § Erasure coding expands effective storage space by ~50%!
  99. 99. Conclusion § Erasure coding expands effective storage space by ~50%! § HDFS-EC phase I implements erasure coding in striped block layout
  100. 100. Conclusion § Erasure coding expands effective storage space by ~50%! § HDFS-EC phase I implements erasure coding in striped block layout § Upstream effort (HDFS-7285): - Design finalized Nov. 2014 - Development started Jan. 2015 - 218 commits, ~25k LoC change - Broad collaboration: Cloudera, Intel, Hortonworks, Huawei, Yahoo (Japan)
  101. 101. Conclusion § Erasure coding expands effective storage space by ~50%! § HDFS-EC phase I implements erasure coding in striped block layout § Upstream effort (HDFS-7285): - Design finalized Nov. 2014 - Development started Jan. 2015 - 218 commits, ~25k LoC change - Broad collaboration: Cloudera, Intel, Hortonworks, Huawei, Yahoo (Japan) § Phase II will support contiguous block layout for better locality
  102. 102. Acknowledgements § Cloudera - Andrew Wang, Aaron T. Myers, Colin McCabe, Todd Lipcon, Silvius Rus § Intel - Kai Zheng, Uma Maheswara Rao G, Vinayakumar B, Yi Liu, Weihua Jiang § Hortonworks - Jing Zhao, Tsz Wo Nicholas Sze § Huawei - Walter Su, Rakesh R, Xinwei Qin § Yahoo (Japan) - Gao Rui, Kai Sasaki, Takuya Fukudome, Hui Zheng
  103. 103. Just merged to trunk!
  104. 104. Questions? Just merged to trunk!
  105. 105. Questions? Just merged to trunk! Erasure Coding:A type of Error Correction Coding
  106. 106. EC in Distributed Storage Spectrum:
  107. 107. EC in Distributed Storage 0~128 M 128~256 M DataNode0 block0 block1 … DataNode1 640~768 M DataNode5 block5 Contiguous DataNode6 DataNode8 data parity … Block Layout: 128~256MFile 0~128M … 640~768M
  108. 108. EC in Distributed Storage 0~128 M 128~256 M DataNode0 block0 block1 … DataNode1 640~768 M DataNode5 block5 Contiguous DataNode6 DataNode8 data parity … Block Layout: Data Locality ! 128~256MFile 0~128M … 640~768M
  109. 109. EC in Distributed Storage 0~128 M 128~256 M DataNode0 block0 block1 … DataNode1 640~768 M DataNode5 block5 Contiguous DataNode6 DataNode8 data parity … Block Layout: Data Locality ! Small Files " 128~256MFile 0~128M … 640~768M
  110. 110. EC in Distributed Storage 0~128 M 128~256 M DataNode0 block0 block1 … DataNode1 640~768 M DataNode5 block5 Contiguous DataNode6 DataNode8 data parity … Block Layout: Data Locality ! Small Files " 128~256MFile … 640~768M
  111. 111. EC in Distributed Storage 0~1M … … 1~2M … … DataNode0 block0 DataNode1 5~6M … 127~128M DataNode5 Striping DataNode6 DataNode8 data parity …… Block Layout:
  112. 112. EC in Distributed Storage 0~1M … … 1~2M … … DataNode0 block0 DataNode1 5~6M … 127~128M DataNode5 Striping DataNode6 DataNode8 data parity …… Block Layout: Data Locality "
  113. 113. EC in Distributed Storage 0~1M … … 1~2M … … DataNode0 block0 DataNode1 5~6M … 127~128M DataNode5 Striping DataNode6 DataNode8 data parity …… Block Layout: Data Locality " Small Files !
  114. 114. EC in Distributed Storage 0~1M … … 1~2M … … DataNode0 block0 DataNode1 5~6M … 127~128M DataNode5 Striping DataNode6 DataNode8 data parity …… Block Layout: Data Locality " Small Files ! Parallel I/O !
  115. 115. Client Parallel Writing blockGroup DataStreamer 0 DataStreamer 1 DataStreamer 2 DataStreamer 3 DataStreamer 4 DFSStripedOutputStream dataQueue 0 dataQueue 1 dataQueue 2 dataQueue 3 dataQueue 4 blk_1009 blk_1010 blk_1011 blk_1012 blk_1013 Coordinator allocate new blockGroup
  116. 116. Client Parallel Reading Stripe 0 Stripe 1 Stripe 2 DataNode 0 DataNode 1 DataNode 2 DataNode 2 DataNode 3 (parity blocks)(data blocks) all zero all zero requested requested requested requested requested recovery read recovery read recovery read recovery read recovery read recovery read recovery read recovery read

×