Less is More: 2X Storage Efficiency with HDFS Erasure Coding

LESS IS MORE
2X storage efficiency with HDFS erasure coding

 HDFS inherits 3-way replication from Google File System
- Simple, scalable and robust
 200% storage overhead
 Secondary replicas rarely accessed
Replication is Expensive

Erasure Coding Saves Storage
 Simplified Example: storing 2 bits
 Same data durability
- can lose any 1 bit
 Half the storage overhead
 Slower recovery
1 01 0Replication:
XOR Coding: 1 0⊕ 1=
2 extra bits
1 extra bit

Erasure Coding Saves Storage
 Facebook
- f4 stores 65PB of BLOBs in EC
 Windows Azure Storage (WAS)
- A PB of new data every 1~2 days
- All “sealed” data stored in EC
 Google File System
- Large portion of data stored in EC

Roadmap
 Background of EC
- Redundancy Theory
- EC in Distributed Storage Systems
 HDFS-EC architecture
- Choosing Block Layout
- NameNode — Generalizing the Block Concept
- Client — Parallel I/O
- DataNode — Background Reconstruction
 Hardware-accelerated Codec Framework

Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
useful data
3-way Replication: Data Durability = 2
Storage Efficiency = 1/3 (33%)
redundant data

XOR:
Data Durability = 1
useful data redundant data
X Y X ⊕ Y
0 0 0
0 1 1
1 0 1
1 1 0
Y = 0 ⊕ 1 = 1

Reed-Solomon (RS):
Data Durability = 2
Very flexible!

Data Durability Storage Efficiency
Single Replica 0 100%
3-way Replication 2 33%
XOR with 6 data cells 1 86%
RS (6,3) 3 67%
RS (10,4) 4 71%

EC in Distributed Storage
Block Layout:
Data Locality 👍🏻
Small Files 👎🏻
128~256MFile 0~128M … 640~768M
0~128
M
block0
DataNode 0
128~
256M
block1
DataNode 1
0~128M 128~256M
…
640~
768M
block5
DataNode 5 DataNode 6
…
parity
Contiguous Layout:

Block Layout:
File
block0
DataNode 0
block1
DataNode 1
…
block5
DataNode 5 DataNode 6
…
parity
Striped Layout:
0~1M 1~2M 5~6M
6~7M
Data Locality 👎🏻
Small Files 👍🏻
Parallel I/O 👍🏻
0~128M 128~256M

Spectrum:
Replication
Erasure
Coding
Striping
Contiguous
Ceph
Ceph
Quancast File System
Quancast File System
HDFS Facebook f4
Windows Azure

Roadmap

-
-
 HDFS-EC architecture
- Choosing Block Layout
- NameNode — Generalizing the Block Concept
- Client — Parallel I/O
- DataNode — Background Reconstruction

Choosing Block Layout
Medium: 1~6 blocksSmall files: < 1 blockAssuming (6,3) coding Large: > 6 blocks (1 group)
96.29%
1.86% 1.85%
26.06%
9.33%
64.61%
small medium large
file count
space usage
Top 2% files occupy ~65% space
Cluster A Profile
86.59%
11.38% 2.03%
23.89%
36.03%
40.08%
file count
space
usage
Top 2% files occupy ~40% space
small medium large
Cluster B Profile
99.64%
0.36% 0.00%
76.05%
20.75%
3.20%
file count
space usage
Dominated by small files
small medium large
Cluster C Profile

Choosing Block Layout
Current
HDFS

Generalizing Block NameNode
Mapping Logical and Storage Blocks Too Many Storage Blocks?
Hierarchical Naming Protocol:

Client Parallel Writing
streamer
queue
streamer … streamer
Coordinator

Client Parallel Reading
…
parity

Reconstruction on DataNode
 Important to avoid delay on the critical path
- Especially if original data is lost
 Integrated with Replication Monitor
- Under-protected EC blocks scheduled together with under-replicated blocks
- New priority algorithms
 New ErasureCodingWorker component on DataNode

Roadmap

-
-

-
-
-
-

Acceleration with Intel ISA-L
 1 legacy coder
- From Facebook’s HDFS-RAID project
 2 new coders
- Pure Java — code improvement over HDFS-RAID
- Native coder with Intel’s Intelligent Storage Acceleration Library (ISA-L)

Microbenchmark: Codec Calculation

Hive-on-Spark — locality sensitive

Conclusion
 Erasure coding expands effective storage space by ~50%!
 HDFS-EC phase I implements erasure coding in striped block layout
 Upstream effort (HDFS-7285):
- Design finalized Nov. 2014
- Development started Jan. 2015
- 218 commits, ~25k LoC change
- Broad collaboration: Cloudera, Intel, Hortonworks, Huawei, Yahoo (Japan)
 Phase II will support contiguous block layout for better locality

Acknowledgements
 Cloudera
- Andrew Wang, Aaron T. Myers, Colin McCabe, Todd Lipcon, Silvius Rus
 Intel
- Kai Zheng, Uma Maheswara Rao G, Vinayakumar B, Yi Liu, Weihua Jiang
 Hortonworks
- Jing Zhao, Tsz Wo Nicholas Sze
 Huawei
- Walter Su, Rakesh R, Xinwei Qin
 Yahoo (Japan)
- Gao Rui, Kai Sasaki, Takuya Fukudome, Hui Zheng

Questions?
Zhe Zhang, LinkedIn
zhz@apache.org | @oldcap
http://zhe-thoughts.github.io/

Less is More: 2X Storage Efficiency with HDFS Erasure Coding

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Less is More: 2X Storage Efficiency with HDFS Erasure Coding

Similar to Less is More: 2X Storage Efficiency with HDFS Erasure Coding (20)

Recently uploaded

Recently uploaded (20)

Less is More: 2X Storage Efficiency with HDFS Erasure Coding

Editor's Notes