SlideShare a Scribd company logo
Debunking the Myths of
HDFS Erasure Coding Performance
 HDFS inherits 3-way replication from Google File System
- Simple, scalable and robust
 200% storage overhead
 Secondary replicas rarely accessed
Replication is Expensive
Erasure Coding Saves Storage
 Simplified Example: storing 2 bits
 Same data durability
- can lose any 1 bit
 Half the storage overhead
 Slower recovery
1 01 0Replication:
XOR Coding: 1 0⊕ 1=
2 extra bits
1 extra bit
Erasure Coding Saves Storage
 Facebook
- f4 stores 65PB of BLOBs in EC
 Windows Azure Storage (WAS)
- A PB of new data every 1~2 days
- All “sealed” data stored in EC
 Google File System
- Large portion of data stored in EC
Roadmap
 Background of EC
- Redundancy Theory
- EC in Distributed Storage Systems
 HDFS-EC architecture
 Hardware-accelerated Codec Framework
 Performance Evaluation
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
useful data
3-way Replication: Data Durability = 2
Storage Efficiency = 1/3 (33%)
redundant data
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
XOR:
Data Durability = 1
Storage Efficiency = 2/3 (67%)
useful data redundant data
X Y X ⊕ Y
0 0 0
0 1 1
1 0 1
1 1 0
Y = 0 ⊕ 1 = 1
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Reed-Solomon (RS):
Data Durability = 2
Storage Efficiency = 4/6 (67%)
Very flexible!
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency
Single Replica 0 100%
3-way Replication 2 33%
XOR with 6 data cells 1 86%
RS (6,3) 3 67%
RS (10,4) 4 71%
EC in Distributed Storage
Block Layout:
Data Locality 👍🏻
Small Files 👎🏻
128~256MFile 0~128M … 640~768M
0~128
M
block0
DataNode 0
128~
256M
block1
DataNode 1
0~128M 128~256M
…
640~
768M
block5
DataNode 5 DataNode 6
…
parity
Contiguous Layout:
EC in Distributed Storage
Block Layout:
File
block0
DataNode 0
block1
DataNode 1
…
block5
DataNode 5 DataNode 6
…
parity
Striped Layout:
0~1M 1~2M 5~6M
6~7M
Data Locality 👎🏻
Small Files 👍🏻
Parallel I/O 👍🏻
0~128M 128~256M
EC in Distributed Storage
Spectrum:
Replication
Erasure
Coding
Striping
Contiguous
Ceph
Ceph
Quancast File System
Quancast File System
HDFS Facebook f4
Windows Azure
Roadmap

-
-
 HDFS-EC architecture
 Hardware-accelerated Codec Framework
 Performance Evaluation
Choosing Block Layout
Medium: 1~6 blocksSmall files: < 1 blockAssuming (6,3) coding Large: > 6 blocks (1 group)
96.29%
1.86% 1.85%
26.06%
9.33%
64.61%
small medium large
file count
space usage
Top 2% files occupy ~65% space
Cluster A Profile
86.59%
11.38% 2.03%
23.89%
36.03%
40.08%
file count
space
usage
Top 2% files occupy ~40% space
small medium large
Cluster B Profile
99.64%
0.36% 0.00%
76.05%
20.75%
3.20%
file count
space usage
Dominated by small files
small medium large
Cluster C Profile
Choosing Block Layout
Current
HDFS
Generalizing Block NameNode
Mapping Logical and Storage Blocks Too Many Storage Blocks?
Hierarchical Naming Protocol:
Client Parallel Writing
streamer
queue
streamer … streamer
Coordinator
Client Parallel Reading
…
parity
Reconstruction on DataNode
 Important to avoid delay on the critical path
- Especially if original data is lost
 Integrated with Replication Monitor
- Under-protected EC blocks scheduled together with under-replicated blocks
- New priority algorithms
 New ErasureCodingWorker component on DataNode
Data Checksum Support
 Supports getFileChecksum for EC striped mode files
- Comparable checksums for same content striped files
- Can’t compare the checksums for contiguous file and striped file
- Can reconstruct on the fly if found block misses while computing
 Planning to introduce new version of getFileChecksum
- To achieve comparable checksums between contiguous and striped file
Roadmap

-
-

 Hardware-accelerated Codec Framework
 Performance Evaluation
Acceleration with Intel ISA-L
 1 legacy coder
- From Facebook’s HDFS-RAID project
 2 new coders
- Pure Java — code improvement over HDFS-RAID
- Native coder with Intel’s Intelligent Storage Acceleration Library (ISA-L)
Why is ISA-L Fast?
pre-computed and reused
parallel
operation
Direct ByteBuffer
Microbenchmark: Codec Calculation
Microbenchmark: Codec Calculation
Microbenchmark: HDFS I/O
Microbenchmark: HDFS I/O
Microbenchmark: HDFS I/O
DFSIO / MapReduce
Hive-on-MR — locality sensitive
Hive-on-Spark — locality sensitive
Conclusion
 Erasure coding expands effective storage space by ~50%!
 HDFS-EC phase I implements erasure coding in striped block layout
 Upstream effort (HDFS-7285):
- Design finalized Nov. 2014
- Development started Jan. 2015
- 218 commits, ~25k LoC change
- Broad collaboration: Cloudera, Intel, Hortonworks, Huawei, Yahoo, LinkedIn
 Phase II will support contiguous block layout for better locality
Acknowledgements
 Cloudera
- Andrew Wang, Aaron T. Myers, Colin McCabe, Todd Lipcon, Silvius Rus
 Intel
- Kai Zheng, Rakesh R, Yi Liu, Weihua Jiang, Rui Li
 Hortonworks
- Jing Zhao, Tsz Wo Nicholas Sze
 Huawei
- Vinayakumar B, Walter Su, Xinwei Qin
 Yahoo (Japan)
- Gao Rui, Kai Sasaki, Takuya Fukudome, Hui Zheng
Questions?
Zhe Zhang, LinkedIn
zhz@apache.org | @oldcap
http://zhe-thoughts.github.io/
Uma Gangumalla, Intel
umamahesh@apache.org
@UmaMaheswaraG
http://blog.cloudera.com/blog/2016/02/progress-report-bringing-erasure-coding-to-apache-hadoop/
Come See us at Intel - Booth 305
“Amazing Analytics from Silicon to Software”
• Intel powers analytics solutions that are optimized for
performance and security from silicon to software
• Intel unleashes the potential of Big Data to enable
advancement in healthcare/ life sciences, retail,
manufacturing, telecom and financial services
• Intel accelerates advanced analytics and machine learning
solutions
Twitter #HS16SJ
LinkedIn Hadoop
Dali: LinkedIn’s Logical
Data Access Layer for
Hadoop
Meetup Thu 6/30
6~9PM @LinkedIn
2nd floor, Unite room
2025 Stierlin Ct
Mountain View
Dr. Elephant: performance
monitoring and tuning.
SFHUG in Aug
Backup
Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance

More Related Content

What's hot

Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
DataWorks Summit
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 

What's hot (20)

The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...
 
Log Structured Merge Tree
Log Structured Merge TreeLog Structured Merge Tree
Log Structured Merge Tree
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
 
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
 
MyRocks Deep Dive
MyRocks Deep DiveMyRocks Deep Dive
MyRocks Deep Dive
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
 
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby NodeHadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
 
Understanding and tuning WiredTiger, the new high performance database engine...
Understanding and tuning WiredTiger, the new high performance database engine...Understanding and tuning WiredTiger, the new high performance database engine...
Understanding and tuning WiredTiger, the new high performance database engine...
 
Anatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxCon
Anatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxConAnatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxCon
Anatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxCon
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookTech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
 
Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBase
 
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
HDFS Internals
HDFS InternalsHDFS Internals
HDFS Internals
 

Viewers also liked

HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage EfficiencyHDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
DataWorks Summit
 
Big Data Pipeline and Analytics Platform
Big Data Pipeline and Analytics PlatformBig Data Pipeline and Analytics Platform
Big Data Pipeline and Analytics Platform
Sudhir Tonse
 

Viewers also liked (20)

Apache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduceApache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduce
 
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage EfficiencyHDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
 
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
 
What's new in hadoop 3.0
What's new in hadoop 3.0What's new in hadoop 3.0
What's new in hadoop 3.0
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Quantcast File System (QFS) - Alternative to HDFS
Quantcast File System (QFS) - Alternative to HDFSQuantcast File System (QFS) - Alternative to HDFS
Quantcast File System (QFS) - Alternative to HDFS
 
HDFS Deep Dive
HDFS Deep DiveHDFS Deep Dive
HDFS Deep Dive
 
Cloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQLCloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQL
 
Erasure codes and storage tiers on gluster
Erasure codes and storage tiers on glusterErasure codes and storage tiers on gluster
Erasure codes and storage tiers on gluster
 
Hadoop 3.0 features
Hadoop 3.0 featuresHadoop 3.0 features
Hadoop 3.0 features
 
Hadoop cluster os_tuning_v1.0_20170106_mobile
Hadoop cluster os_tuning_v1.0_20170106_mobileHadoop cluster os_tuning_v1.0_20170106_mobile
Hadoop cluster os_tuning_v1.0_20170106_mobile
 
Big Data Pipeline and Analytics Platform
Big Data Pipeline and Analytics PlatformBig Data Pipeline and Analytics Platform
Big Data Pipeline and Analytics Platform
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course Workshop
 
Evolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemEvolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage Subsystem
 
Reed Solomon
Reed SolomonReed Solomon
Reed Solomon
 
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaHadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
 
빅데이터 분석을 위한 스파크 2 프로그래밍 : 대용량 데이터 처리부터 머신러닝까지
빅데이터 분석을 위한 스파크 2 프로그래밍 : 대용량 데이터 처리부터 머신러닝까지빅데이터 분석을 위한 스파크 2 프로그래밍 : 대용량 데이터 처리부터 머신러닝까지
빅데이터 분석을 위한 스파크 2 프로그래밍 : 대용량 데이터 처리부터 머신러닝까지
 
Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)
 
Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2
 
HDFS Analysis for Small Files
HDFS Analysis for Small FilesHDFS Analysis for Small Files
HDFS Analysis for Small Files
 

Similar to Debunking the Myths of HDFS Erasure Coding Performance

SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
Chester Chen
 
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File SystemFredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma
 
Storing data in windows server 2012 ss
Storing data in windows server 2012 ssStoring data in windows server 2012 ss
Storing data in windows server 2012 ss
Kamil Bączyk
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Databricks
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Databricks
 

Similar to Debunking the Myths of HDFS Erasure Coding Performance (20)

Less is More: 2X Storage Efficiency with HDFS Erasure Coding
Less is More: 2X Storage Efficiency with HDFS Erasure CodingLess is More: 2X Storage Efficiency with HDFS Erasure Coding
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
 
Hadoop enhancements using next gen IA technologies
Hadoop enhancements using next gen IA technologiesHadoop enhancements using next gen IA technologies
Hadoop enhancements using next gen IA technologies
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Hug syncsort etl hadoop big data
Hug syncsort etl hadoop big dataHug syncsort etl hadoop big data
Hug syncsort etl hadoop big data
 
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
 
Ceph
CephCeph
Ceph
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5
 
Modeling data and best practices for the Azure Cosmos DB.
Modeling data and best practices for the Azure Cosmos DB.Modeling data and best practices for the Azure Cosmos DB.
Modeling data and best practices for the Azure Cosmos DB.
 
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File SystemFredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
 
Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...
Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...
Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...
 
Storing data in windows server 2012 ss
Storing data in windows server 2012 ssStoring data in windows server 2012 ss
Storing data in windows server 2012 ss
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010
 
optimizing_ceph_flash
optimizing_ceph_flashoptimizing_ceph_flash
optimizing_ceph_flash
 

More from DataWorks Summit/Hadoop Summit

How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Recently uploaded

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Peter Udo Diehl
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 

Recently uploaded (20)

Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
 

Debunking the Myths of HDFS Erasure Coding Performance

  • 1. Debunking the Myths of HDFS Erasure Coding Performance
  • 2.  HDFS inherits 3-way replication from Google File System - Simple, scalable and robust  200% storage overhead  Secondary replicas rarely accessed Replication is Expensive
  • 3. Erasure Coding Saves Storage  Simplified Example: storing 2 bits  Same data durability - can lose any 1 bit  Half the storage overhead  Slower recovery 1 01 0Replication: XOR Coding: 1 0⊕ 1= 2 extra bits 1 extra bit
  • 4. Erasure Coding Saves Storage  Facebook - f4 stores 65PB of BLOBs in EC  Windows Azure Storage (WAS) - A PB of new data every 1~2 days - All “sealed” data stored in EC  Google File System - Large portion of data stored in EC
  • 5. Roadmap  Background of EC - Redundancy Theory - EC in Distributed Storage Systems  HDFS-EC architecture  Hardware-accelerated Codec Framework  Performance Evaluation
  • 6. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? useful data 3-way Replication: Data Durability = 2 Storage Efficiency = 1/3 (33%) redundant data
  • 7. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? XOR: Data Durability = 1 Storage Efficiency = 2/3 (67%) useful data redundant data X Y X ⊕ Y 0 0 0 0 1 1 1 0 1 1 1 0 Y = 0 ⊕ 1 = 1
  • 8. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Reed-Solomon (RS): Data Durability = 2 Storage Efficiency = 4/6 (67%) Very flexible!
  • 9. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Data Durability Storage Efficiency Single Replica 0 100% 3-way Replication 2 33% XOR with 6 data cells 1 86% RS (6,3) 3 67% RS (10,4) 4 71%
  • 10. EC in Distributed Storage Block Layout: Data Locality 👍🏻 Small Files 👎🏻 128~256MFile 0~128M … 640~768M 0~128 M block0 DataNode 0 128~ 256M block1 DataNode 1 0~128M 128~256M … 640~ 768M block5 DataNode 5 DataNode 6 … parity Contiguous Layout:
  • 11. EC in Distributed Storage Block Layout: File block0 DataNode 0 block1 DataNode 1 … block5 DataNode 5 DataNode 6 … parity Striped Layout: 0~1M 1~2M 5~6M 6~7M Data Locality 👎🏻 Small Files 👍🏻 Parallel I/O 👍🏻 0~128M 128~256M
  • 12. EC in Distributed Storage Spectrum: Replication Erasure Coding Striping Contiguous Ceph Ceph Quancast File System Quancast File System HDFS Facebook f4 Windows Azure
  • 13. Roadmap  - -  HDFS-EC architecture  Hardware-accelerated Codec Framework  Performance Evaluation
  • 14. Choosing Block Layout Medium: 1~6 blocksSmall files: < 1 blockAssuming (6,3) coding Large: > 6 blocks (1 group) 96.29% 1.86% 1.85% 26.06% 9.33% 64.61% small medium large file count space usage Top 2% files occupy ~65% space Cluster A Profile 86.59% 11.38% 2.03% 23.89% 36.03% 40.08% file count space usage Top 2% files occupy ~40% space small medium large Cluster B Profile 99.64% 0.36% 0.00% 76.05% 20.75% 3.20% file count space usage Dominated by small files small medium large Cluster C Profile
  • 16. Generalizing Block NameNode Mapping Logical and Storage Blocks Too Many Storage Blocks? Hierarchical Naming Protocol:
  • 19. Reconstruction on DataNode  Important to avoid delay on the critical path - Especially if original data is lost  Integrated with Replication Monitor - Under-protected EC blocks scheduled together with under-replicated blocks - New priority algorithms  New ErasureCodingWorker component on DataNode
  • 20. Data Checksum Support  Supports getFileChecksum for EC striped mode files - Comparable checksums for same content striped files - Can’t compare the checksums for contiguous file and striped file - Can reconstruct on the fly if found block misses while computing  Planning to introduce new version of getFileChecksum - To achieve comparable checksums between contiguous and striped file
  • 21. Roadmap  - -   Hardware-accelerated Codec Framework  Performance Evaluation
  • 22. Acceleration with Intel ISA-L  1 legacy coder - From Facebook’s HDFS-RAID project  2 new coders - Pure Java — code improvement over HDFS-RAID - Native coder with Intel’s Intelligent Storage Acceleration Library (ISA-L)
  • 23. Why is ISA-L Fast? pre-computed and reused parallel operation Direct ByteBuffer
  • 32. Conclusion  Erasure coding expands effective storage space by ~50%!  HDFS-EC phase I implements erasure coding in striped block layout  Upstream effort (HDFS-7285): - Design finalized Nov. 2014 - Development started Jan. 2015 - 218 commits, ~25k LoC change - Broad collaboration: Cloudera, Intel, Hortonworks, Huawei, Yahoo, LinkedIn  Phase II will support contiguous block layout for better locality
  • 33. Acknowledgements  Cloudera - Andrew Wang, Aaron T. Myers, Colin McCabe, Todd Lipcon, Silvius Rus  Intel - Kai Zheng, Rakesh R, Yi Liu, Weihua Jiang, Rui Li  Hortonworks - Jing Zhao, Tsz Wo Nicholas Sze  Huawei - Vinayakumar B, Walter Su, Xinwei Qin  Yahoo (Japan) - Gao Rui, Kai Sasaki, Takuya Fukudome, Hui Zheng
  • 34. Questions? Zhe Zhang, LinkedIn zhz@apache.org | @oldcap http://zhe-thoughts.github.io/ Uma Gangumalla, Intel umamahesh@apache.org @UmaMaheswaraG http://blog.cloudera.com/blog/2016/02/progress-report-bringing-erasure-coding-to-apache-hadoop/
  • 35. Come See us at Intel - Booth 305 “Amazing Analytics from Silicon to Software” • Intel powers analytics solutions that are optimized for performance and security from silicon to software • Intel unleashes the potential of Big Data to enable advancement in healthcare/ life sciences, retail, manufacturing, telecom and financial services • Intel accelerates advanced analytics and machine learning solutions Twitter #HS16SJ
  • 36. LinkedIn Hadoop Dali: LinkedIn’s Logical Data Access Layer for Hadoop Meetup Thu 6/30 6~9PM @LinkedIn 2nd floor, Unite room 2025 Stierlin Ct Mountain View Dr. Elephant: performance monitoring and tuning. SFHUG in Aug

Editor's Notes

  1. Simply put, it doubles the storage capacity of your cluster. This talk explains how it happens. Blog post link.
  2. When the GFS paper was published more than a decade ago, the objective was to store massive amount of data on a large number of cheap commodity machines. A breakthrough design was to rely on machine-level replication to protect against machine failures, instead of xxx.
  3. A more efficient approach to reliably store data is erasure coding. Here’s a simplified example
  4. In this talk I will introduce how we implemented erasure coding in HDFS.
  5. RS uses more sophisticated linear algebra operations to generate multiple parity cells, and thus can tolerate multiple failures per group. It works by multiplying a vector of k data cells with a Generator Matrix (GT) to generate an extended codeword vector with k data cells and m parity cells. In this particular example, it combines the strong durability of replication and high efficiency of simple XOR. More importantly, flexible.
  6. To manage potentially very large files, distributed storage systems usually divide files into fixed-size logical byte ranges called logical blocks. These logical blocks are then mapped to storage blocks on the cluster, which reflect the physical layout of data on the cluster. The simplest mapping between logical and storage blocks is a contiguous block layout, which maps each logical block one-to-one to a storage block. Reading a file with a contiguous block layout is as easy as reading each storage block linearly in sequence.
  7. To manage potentially very large files, distributed storage systems usually divide files into fixed-size logical byte ranges called logical blocks. These logical blocks are then mapped to storage blocks on the cluster, which reflect the physical layout of data on the cluster. The simplest mapping between logical and storage blocks is a contiguous block layout, which maps each logical block one-to-one to a storage block. Reading a file with a contiguous block layout is as easy as reading each storage block linearly in sequence.
  8. Non-trivial trade-offs between x and x, y and y.
  9. In all cases, the saving from EC will be significantly lower if only applied on large files. In some cases, no savings at all.
  10. The former represents a logical byte range in a file, while the latter is the basic unit of data chunks stored on a DataNode. In the example, the file /tmp/foo is logically divided into 13 striping cells (cell_0 through cell_12). Logical block 0 represents the logical byte range of cells 0~8, and logical block 1 represents cells 9~12. Cells 0, 3, 6 form a storage block, which will be stored as a single chunk of data on a DataNode. To reduce this overhead we have introduced a new hierarchical block naming protocol. Currently HDFS allocates block IDs sequentially based on block creation time. This protocol instead divides each block ID into 2~3 sections, as illustrated in Figure 7. Each block ID starts with a flag indicating its layout (contiguous=0, striped=1). For striped blocks, the rest of the ID consists of two parts: the middle section with ID of the logical block and the tail section representing the index of a storage block in the logical block. This allows the NameNode to manage a logical block as a summary of its storage blocks. Storage block IDs can be mapped to their logical block by masking the index; this is required when the NameNode processes DataNode block reports.
  11. Figure 8 first shows results from an in-memory encoding/decoding micro benchmark. The ISA-L implementation outperforms the HDFS-EC Java implementation by more than 4x, and the Facebook HDFS-RAID coder by ~20x. Based on the results, we strongly recommend the ISA-L accelerated implementation for all production deployments.
  12. Figure 8 first shows results from an in-memory encoding/decoding micro benchmark. The ISA-L implementation outperforms the HDFS-EC Java implementation by more than 4x, and the Facebook HDFS-RAID coder by ~20x. Based on the results, we strongly recommend the ISA-L accelerated implementation for all production deployments.
  13. We also compared end-to-end HDFS I/O performance with these different coders against HDFS’s default scheme of three-way replication. The tests were performed on a cluster with 11 nodes (1 NameNode, 9 DataNodes, 1 client node) interconnected with 10 GigE network. Figure 9 shows the throughput results of 1) client writing a 12GB file to HDFS; and 2) client reading a 12GB file from HDFS. In the reading tests we manually killed two DataNodes so the results include decoding overhead. As shown in Figure 9, in both sequential write and read and read benchmarks, throughput is greatly constrained by the pure Java coders (HDFS-RAID and our own implementation). The ISA-L implementation is much faster than the pure Java coders because of its excellent CPU efficiency. It also outperforms replication by 2-3x because the striped layout allows the client to perform I/O with multiple DataNodes in parallel, leveraging the aggregate bandwidth of their disk drives. We have also tested read performance without any DataNode failure: HDFS-EC is roughly 5x faster than three-way replication.
  14. Phase 2 backup slide
  15. Accelerating Advanced Analytics and Machine Learning Solutions “Accelerated Machine Learning and Big Data Analytics Applications (through or with) the Trusted Analytics Platform” Need tagline for Machine Learning: need ML tagline from Nidhi Accelerating Machine Learning Applications and Big Data deployments through Trusted Analytics Platform Remove TAP tagline. Or say “Accelerate analytics on Big Data with Trusted Analytics Platform” Last line lessen words: just say “visit us at booth 409”