Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Debunking the Myths of HDFS Erasure Coding Performance

2,609 views

Published on

Debunking the Myths of HDFS Erasure Coding Performance

Published in: Technology
  • Be the first to comment

Debunking the Myths of HDFS Erasure Coding Performance

  1. 1. Debunking the Myths of HDFS Erasure Coding Performance
  2. 2.  HDFS inherits 3-way replication from Google File System - Simple, scalable and robust  200% storage overhead  Secondary replicas rarely accessed Replication is Expensive
  3. 3. Erasure Coding Saves Storage  Simplified Example: storing 2 bits  Same data durability - can lose any 1 bit  Half the storage overhead  Slower recovery 1 01 0Replication: XOR Coding: 1 0⊕ 1= 2 extra bits 1 extra bit
  4. 4. Erasure Coding Saves Storage  Facebook - f4 stores 65PB of BLOBs in EC  Windows Azure Storage (WAS) - A PB of new data every 1~2 days - All “sealed” data stored in EC  Google File System - Large portion of data stored in EC
  5. 5. Roadmap  Background of EC - Redundancy Theory - EC in Distributed Storage Systems  HDFS-EC architecture  Hardware-accelerated Codec Framework  Performance Evaluation
  6. 6. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? useful data 3-way Replication: Data Durability = 2 Storage Efficiency = 1/3 (33%) redundant data
  7. 7. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? XOR: Data Durability = 1 Storage Efficiency = 2/3 (67%) useful data redundant data X Y X ⊕ Y 0 0 0 0 1 1 1 0 1 1 1 0 Y = 0 ⊕ 1 = 1
  8. 8. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Reed-Solomon (RS): Data Durability = 2 Storage Efficiency = 4/6 (67%) Very flexible!
  9. 9. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Data Durability Storage Efficiency Single Replica 0 100% 3-way Replication 2 33% XOR with 6 data cells 1 86% RS (6,3) 3 67% RS (10,4) 4 71%
  10. 10. EC in Distributed Storage Block Layout: Data Locality 👍🏻 Small Files 👎🏻 128~256MFile 0~128M … 640~768M 0~128 M block0 DataNode 0 128~ 256M block1 DataNode 1 0~128M 128~256M … 640~ 768M block5 DataNode 5 DataNode 6 … parity Contiguous Layout:
  11. 11. EC in Distributed Storage Block Layout: File block0 DataNode 0 block1 DataNode 1 … block5 DataNode 5 DataNode 6 … parity Striped Layout: 0~1M 1~2M 5~6M 6~7M Data Locality 👎🏻 Small Files 👍🏻 Parallel I/O 👍🏻 0~128M 128~256M
  12. 12. EC in Distributed Storage Spectrum: Replication Erasure Coding Striping Contiguous Ceph Ceph Quancast File System Quancast File System HDFS Facebook f4 Windows Azure
  13. 13. Roadmap  - -  HDFS-EC architecture  Hardware-accelerated Codec Framework  Performance Evaluation
  14. 14. Choosing Block Layout Medium: 1~6 blocksSmall files: < 1 blockAssuming (6,3) coding Large: > 6 blocks (1 group) 96.29% 1.86% 1.85% 26.06% 9.33% 64.61% small medium large file count space usage Top 2% files occupy ~65% space Cluster A Profile 86.59% 11.38% 2.03% 23.89% 36.03% 40.08% file count space usage Top 2% files occupy ~40% space small medium large Cluster B Profile 99.64% 0.36% 0.00% 76.05% 20.75% 3.20% file count space usage Dominated by small files small medium large Cluster C Profile
  15. 15. Choosing Block Layout Current HDFS
  16. 16. Generalizing Block NameNode Mapping Logical and Storage Blocks Too Many Storage Blocks? Hierarchical Naming Protocol:
  17. 17. Client Parallel Writing streamer queue streamer … streamer Coordinator
  18. 18. Client Parallel Reading … parity
  19. 19. Reconstruction on DataNode  Important to avoid delay on the critical path - Especially if original data is lost  Integrated with Replication Monitor - Under-protected EC blocks scheduled together with under-replicated blocks - New priority algorithms  New ErasureCodingWorker component on DataNode
  20. 20. Data Checksum Support  Supports getFileChecksum for EC striped mode files - Comparable checksums for same content striped files - Can’t compare the checksums for contiguous file and striped file - Can reconstruct on the fly if found block misses while computing  Planning to introduce new version of getFileChecksum - To achieve comparable checksums between contiguous and striped file
  21. 21. Roadmap  - -   Hardware-accelerated Codec Framework  Performance Evaluation
  22. 22. Acceleration with Intel ISA-L  1 legacy coder - From Facebook’s HDFS-RAID project  2 new coders - Pure Java — code improvement over HDFS-RAID - Native coder with Intel’s Intelligent Storage Acceleration Library (ISA-L)
  23. 23. Why is ISA-L Fast? pre-computed and reused parallel operation Direct ByteBuffer
  24. 24. Microbenchmark: Codec Calculation
  25. 25. Microbenchmark: Codec Calculation
  26. 26. Microbenchmark: HDFS I/O
  27. 27. Microbenchmark: HDFS I/O
  28. 28. Microbenchmark: HDFS I/O
  29. 29. DFSIO / MapReduce
  30. 30. Hive-on-MR — locality sensitive
  31. 31. Hive-on-Spark — locality sensitive
  32. 32. Conclusion  Erasure coding expands effective storage space by ~50%!  HDFS-EC phase I implements erasure coding in striped block layout  Upstream effort (HDFS-7285): - Design finalized Nov. 2014 - Development started Jan. 2015 - 218 commits, ~25k LoC change - Broad collaboration: Cloudera, Intel, Hortonworks, Huawei, Yahoo, LinkedIn  Phase II will support contiguous block layout for better locality
  33. 33. Acknowledgements  Cloudera - Andrew Wang, Aaron T. Myers, Colin McCabe, Todd Lipcon, Silvius Rus  Intel - Kai Zheng, Rakesh R, Yi Liu, Weihua Jiang, Rui Li  Hortonworks - Jing Zhao, Tsz Wo Nicholas Sze  Huawei - Vinayakumar B, Walter Su, Xinwei Qin  Yahoo (Japan) - Gao Rui, Kai Sasaki, Takuya Fukudome, Hui Zheng
  34. 34. Questions? Zhe Zhang, LinkedIn zhz@apache.org | @oldcap http://zhe-thoughts.github.io/ Uma Gangumalla, Intel umamahesh@apache.org @UmaMaheswaraG http://blog.cloudera.com/blog/2016/02/progress-report-bringing-erasure-coding-to-apache-hadoop/
  35. 35. Come See us at Intel - Booth 305 “Amazing Analytics from Silicon to Software” • Intel powers analytics solutions that are optimized for performance and security from silicon to software • Intel unleashes the potential of Big Data to enable advancement in healthcare/ life sciences, retail, manufacturing, telecom and financial services • Intel accelerates advanced analytics and machine learning solutions Twitter #HS16SJ
  36. 36. LinkedIn Hadoop Dali: LinkedIn’s Logical Data Access Layer for Hadoop Meetup Thu 6/30 6~9PM @LinkedIn 2nd floor, Unite room 2025 Stierlin Ct Mountain View Dr. Elephant: performance monitoring and tuning. SFHUG in Aug
  37. 37. Backup

×