SlideShare a Scribd company logo
Consistent Reads from Standby Node
Konstantin V Shvachko
Sr. Staff Software Engineer
@LinkedIn
Chen Liang
Senior Software Engineer
@LinkedIn
Chao Sun
Software Engineer
@Uber
Agenda
HDFS CONSISTENT READ FROM STANDBY
1
• Motivation
• Consistency Read from Standby
• Challenges
• Design and Implementation
• Next steps
The Team
2
• Konstantin Shvachko (LinkedIn)
• Chen Liang (LinkedIn)
• Erik Krogen (LinkedIn)
• Chao Sun (Uber)
• Plamen Jeliazkov (Paypal)
Consistent Reads From
Standby Nodes
Motivation
4
• 2x Growth/Year In Workloads and Size
• Approaching active Name Node performance limits rapidly
• We need a scalability solution
• Key Insights:
• Reads comprise 95% of all metadata operations in our practice
• Another source of truth for read: Standby Nodes
• Standby Nodes Serving Read Requests
• Can substantially decrease active Name Node workload
• Allowing cluster to scale further!
Architecture
ROLE OF STANDBY NODES
DataNodes
Active
NameNode
Standby
NameNodes
JournalNodes
Write
5
/Read• Standby nodes have same copy of all
metadata (with some delay)
• Standby Node syncs edits from Active
NameNode
• Standby nodes can potentially serve read
requests
• All reads can go to Standby nodes
• OR, time critical applications can still
choose to read from Active only
Challenges
DataNodes
Active
NameNode
Standby
NameNodes
JournalNodes
Write Read
6
/Read
• Standby Node delay
• ANN write edits to JN, then SbNN
applying the edits from JN
• With delay at minute magnitude
• Consistency
• If client performs a read after a write,
client would expect to see the state
change
Fast Journaling
DELAY REDUCTION
7
• Fast Edit Tailing HDFS-13150
• Current JN is slow: serving whole segments of edits from disk
• Optimization on JN and SbNN
o JN caching recent edits in memory, only applied edits are served
o SbNN request only recent edits through RPC calls
o Fall back to existing mechanism on error
• Significantly reduce SbNN delay
o Reduce from 1 minute to 2 to 50 milliseconds
• Standby node delay is no more than a few ms in most cases
Consistency Model
8
• Consistency Principle:
• If client c1 modifies an object state at id1
at time t1, then in any future time t2 > t1,
c1 will see the state of that object at id2 >=
id1
• Read-Your-Own-Write
• Client writes to Active NameNode
• Then read from the StandbyNode
• Read should reflect the write
Active
NameNode
Standby
NameNodes
JournalNodes
lastSeenStateId
txnid = 100
= 100
txnid = 99
100 100
txnid = 100
Consistency Model
9
• Consistency Principle:
• If client c1 modifies an object state at id1
at time t1, then in any future time t2 > t1,
c1 will see the state of that object at id2 >=
id1
• LastSeenStateID
• Monotonically increasing Id of ANN
namespace state txnid
• Kept on client side, client’s known most
recent ANN state
• Sent to SbNN, SbNN only replies after it
has caught up to this state
Active
NameNode
Standby
NameNodes
JournalNodes
lastSeenStateId
txnid = 100
= 100
100 100
Corner Case: Stale Reads
10
• Stale Read Cases
• Case1: Multiple client instances
• DFSClient#1 to write to ANN, DFSClient#2 to
read SbNN
• DFSClient#2’s state older than DFSClient#1,
read is out of sync
• Case2: Out-of-band communication
• Client#1 writes to ANN, inform client#2
• Client#2 read from SbNN, not see the write
Active
NameNode
DFSClient#1
Standby
NameNode
Write
DFSClient#2
Read
Read your own writes
Active
NameNode
DFSClient#1
Standby
NameNode
Write
DFSClient#2
Read
Third-party communication
msync API
11
• Dealing with Stale Reads: FileSystem.msync()
• Sync between existing client instances
• Force the DFSClient to sync up to the most
recent state of ANN
• Multiple client instances: call msync on
DFSClient#2
• Out-of-band communication: client#2 calls
msync first before read
• “Always msync” mode HDFS-14211
Active
NameNode
DFSClient#1
Standby
NameNode
Write
DFSClient#2
Read
Read your own writes
Active
NameNode
DFSClient#1
Standby
NameNode
Write
DFSClient#2
Read
Third-party communication
Robustness Optimization: Standby Node Back-off
REDIRECT WHEN TOO FAR BEHIND
• In the case where a Standby node state is too far behind, client may retry another node
• e.g. Standby node machine running slow
• Standby Node Back-off
• 1: Upon receiving request, if Standby node finds itself too far behind requested state, it
rejects the request, throwing retry exception
• 2: If a request has been in queue for long, and Standby still is not caught up, Standby
rejects the request, throwing retry exception
• Client Retry
• Upon retry exception, client tries a different standby node, or simply falling back to
ANN 12
Configuration and Startup Process
13
• Configuring NameNodes
• Configure namenodes via haadmin
• Observer mode is similar to Standby, but serves
read and does not perform checkpointing
• All NameNodes start as check pointing Standby,
Standby can be transitioned to Active or Observer
• Configuring Client
• Configure to use ObserverReadProxyProvider
• If not, client still works but only talks to ANN
• ObserverReadProxy will discover the state of all
NNs
Active
Standby
Observer
Check
Pointing
Standby
Read
Serving
Standby
Active
Current Status
14
• Test and benchmark
• With YARN application, e.g. TeraSort
• With HDFS benchmarks, e.g. DFSIO
• Run on a cluster with >100 nodes and with Kerberos and delegation token enabled
• Merged to trunk (3.3.0)
• Being backported to branch-2
• Active work on further improvement/optimization
• Has been running at Uber in production
Background
● Back in 2017, Uber’s HDFS clusters were in a bad shape
○ Rapid growth in # of jobs accessing HDFS
○ Ingestion & adhoc jobs co-locate on the same cluster
○ Lots of listing calls on very large directories (esp. Hudi)
● HDFS traffic composition: 96% reads, 4% writes
● Presto is very sensitive to HDFS latency
○ Occupies ~20% of HDFS traffic
○ Only reads from HDFS, no write
Implementation & Timeline
● Implementation (compare to open source version)
○ No msync or fast edit log tailing
■ Only eventual consistency with max staleness of 10s
○ Observer was NOT eligible to NN failover
○ Batched edits loading to reduce NN locktime when tailing edits
● Timeline
○ 08/2017 - finished the POC and basic testing in dev clusters
○ 12/2017 - started collaborating with HDFS open source community (e.g.,
Linkedin, Paypal)
○ 12/2018 - fully rolled out to Presto in production
○ Tool multiple retries along the process
■ Disable access time (dfs.namenode.accesstime.precision)
■ HDFS-13898, HDFS-13924
Impact
Comparing to traffic goes to active NameNode, Observer NameNode
improves the overall throughput by ~20% (roughly the same throughput
from Presto), while RPC queue time has dropped ~30X.
Impact (cont.)
Presto listing status call latency has dropped 8-10X after migrating to
Observer
Next Steps
Three-Stage Scalability Plan
2X GROWTH / YEAR IN WORKLOADS AND SIZE
• Stage I. Consistent reads from standby
• Optimize for reads: 95% of all operations
• Consistent reading is a coordination problem
• Stage II. In-memory Partitioned Namespace
• Optimize write operations
• Eliminate NameNode’s global lock – fine-grained locking
• Stage III. Dynamically Distributed Namespace Service
• Linear scaling to accommodate increases in RPC load and metadata growth
HDFS-12943
20
NameNode Current State
NAMENODE’S GLOBAL LOCK – PERFORMANCE BOTTLENECK
• Three main data structures
• INodeMap: id -> INode
• BlocksMap: key -> BlockInfo
• DatanodeMap: don’t split
• GSet – an efficient HashMap
implementation
• Hash(key) -> Value
• Global lock to update INodes and
blocks
21
NameNode – FSNamesystem
INodeMap – Directory Tree
GSet: Id -> INode
BlocksMap – Block Manager
GSet: Block -> BlockInfo
DataNode Manager
Stage II. In-memory Partitioned Namespace
ELIMINATE NAMENODE’S GLOBAL LOCK
• PartitionedGSet:
• two level mapping
1. RangeMap: keyRange -> GSet
2. RangeGSet: key -> INode
• Fine-grained locking
• Individual locks per range
• Different ranges are accessed
in parallel
22
NameNode
GSet-1
DataNode Manager
GSet-2 GSet-n
GSet-1 GSet-2 GSet-n
INodeMap - Partitioned GSet
BlocksMap - Partitioned GSet
Stage II. In-memory Partitioned Namespace
EARLY POC RESULTS
23
• PartitionedGSet: two level mapping
• LatchLock: swap RangeMap lock for GSet locks corresponding to inode keys
• Run NNTroughputBenchmark creating 10 million directories
• 30% throughput gain
• Large batches of edits
• Why not 100%?
• Key is inodeId – incremental number generator
• Contention on the last partition
• Expect MORE
Stage III. Dynamically Distributed Namespace
SCALABLE DATA AND METADATA
• Split NameNode state into
multiple servers based on ranges
• Each NameNode
• Serves a designate range of
INode keys
• Metadata in PartitionedGSet
• Can reassign certain
subranges to adjacent nodes
• Coordination Service (Ratis)
• Change ranges served by NNs
• Renames / moves, Quotas
24
NameNode 1
INodeMap
Part-GSet
DataNode
Manager
BlocksMap
Part-GSet
NameNode 2 NameNode n
INodeMap
Part-GSet
DataNode
Manager
BlocksMap
Part-GSet
INodeMap
Part-GSet
DataNode
Manager
BlocksMap
Part-GSet
Thank You!
Konstantin V Shvachko Chen Liang Chao Sun
Sr. Staff
Software Engineer
@LinkedIn
Software Engineer
@Uber
Senior
Software Engineer
@LinkedIn
25
Consistent Reads from Standby Node

More Related Content

What's hot

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


Cloudera, Inc.
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
DataWorks Summit
 
Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC
Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GCHadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC
Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC
Erik Krogen
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
DataWorks Summit
 
MyRocks Deep Dive
MyRocks Deep DiveMyRocks Deep Dive
MyRocks Deep Dive
Yoshinori Matsunobu
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
Taras Matyashovsky
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
Yoshinori Matsunobu
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Cloudera, Inc.
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
Jeff Holoman
 
InfluxDB IOx Tech Talks: The Impossible Dream: Easy-to-Use, Super Fast Softw...
InfluxDB IOx Tech Talks: The Impossible Dream:  Easy-to-Use, Super Fast Softw...InfluxDB IOx Tech Talks: The Impossible Dream:  Easy-to-Use, Super Fast Softw...
InfluxDB IOx Tech Talks: The Impossible Dream: Easy-to-Use, Super Fast Softw...
InfluxData
 
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation BuffersHBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
Cloudera, Inc.
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
Thejas Nair
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
DataWorks Summit/Hadoop Summit
 
HDFS Namenode High Availability
HDFS Namenode High AvailabilityHDFS Namenode High Availability
HDFS Namenode High Availability
Hortonworks
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
ScyllaDB
 
HDFS Trunncate: Evolving Beyond Write-Once Semantics
HDFS Trunncate: Evolving Beyond Write-Once SemanticsHDFS Trunncate: Evolving Beyond Write-Once Semantics
HDFS Trunncate: Evolving Beyond Write-Once Semantics
DataWorks Summit
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
DataArt
 
Galera Cluster Best Practices for DBA's and DevOps Part 1
Galera Cluster Best Practices for DBA's and DevOps Part 1Galera Cluster Best Practices for DBA's and DevOps Part 1
Galera Cluster Best Practices for DBA's and DevOps Part 1
Codership Oy - Creators of Galera Cluster
 

What's hot (20)

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
 
Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC
Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GCHadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC
Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
 
HBase Low Latency
HBase Low LatencyHBase Low Latency
HBase Low Latency
 
MyRocks Deep Dive
MyRocks Deep DiveMyRocks Deep Dive
MyRocks Deep Dive
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
InfluxDB IOx Tech Talks: The Impossible Dream: Easy-to-Use, Super Fast Softw...
InfluxDB IOx Tech Talks: The Impossible Dream:  Easy-to-Use, Super Fast Softw...InfluxDB IOx Tech Talks: The Impossible Dream:  Easy-to-Use, Super Fast Softw...
InfluxDB IOx Tech Talks: The Impossible Dream: Easy-to-Use, Super Fast Softw...
 
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation BuffersHBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
 
HDFS Namenode High Availability
HDFS Namenode High AvailabilityHDFS Namenode High Availability
HDFS Namenode High Availability
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
 
HDFS Trunncate: Evolving Beyond Write-Once Semantics
HDFS Trunncate: Evolving Beyond Write-Once SemanticsHDFS Trunncate: Evolving Beyond Write-Once Semantics
HDFS Trunncate: Evolving Beyond Write-Once Semantics
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Galera Cluster Best Practices for DBA's and DevOps Part 1
Galera Cluster Best Practices for DBA's and DevOps Part 1Galera Cluster Best Practices for DBA's and DevOps Part 1
Galera Cluster Best Practices for DBA's and DevOps Part 1
 

Similar to Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node

Scaling HDFS at Xiaomi
Scaling HDFS at XiaomiScaling HDFS at Xiaomi
Scaling HDFS at Xiaomi
DataWorks Summit
 
Scaling HDFS at Xiaomi
Scaling HDFS at XiaomiScaling HDFS at Xiaomi
Scaling HDFS at Xiaomi
DataWorks Summit
 
RedisConf18 - Redis at LINE - 25 Billion Messages Per Day
RedisConf18 - Redis at LINE - 25 Billion Messages Per DayRedisConf18 - Redis at LINE - 25 Billion Messages Per Day
RedisConf18 - Redis at LINE - 25 Billion Messages Per Day
Redis Labs
 
Storing eBay's Media Metadata on MongoDB, by Yuri Finkelstein, Architect, eBay
Storing eBay's Media Metadata on MongoDB, by Yuri Finkelstein, Architect, eBayStoring eBay's Media Metadata on MongoDB, by Yuri Finkelstein, Architect, eBay
Storing eBay's Media Metadata on MongoDB, by Yuri Finkelstein, Architect, eBay
MongoDB
 
MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB present...
MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB  present...MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB  present...
MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB present...
MongoDB
 
CephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at Last
Ceph Community
 
10 Ways to Scale Your Website Silicon Valley Code Camp 2019
10 Ways to Scale Your Website Silicon Valley Code Camp 201910 Ways to Scale Your Website Silicon Valley Code Camp 2019
10 Ways to Scale Your Website Silicon Valley Code Camp 2019
Dave Nielsen
 
ApacheCon BigData - What it takes to process a trillion events a day?
ApacheCon BigData - What it takes to process a trillion events a day?ApacheCon BigData - What it takes to process a trillion events a day?
ApacheCon BigData - What it takes to process a trillion events a day?
Jagadish Venkatraman
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
ScyllaDB
 
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
Lucidworks
 
10 Ways to Scale with Redis - LA Redis Meetup 2019
10 Ways to Scale with Redis - LA Redis Meetup 201910 Ways to Scale with Redis - LA Redis Meetup 2019
10 Ways to Scale with Redis - LA Redis Meetup 2019
Dave Nielsen
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
DataWorks Summit/Hadoop Summit
 
Signing DNSSEC answers on the fly at the edge: challenges and solutions
Signing DNSSEC answers on the fly at the edge: challenges and solutionsSigning DNSSEC answers on the fly at the edge: challenges and solutions
Signing DNSSEC answers on the fly at the edge: challenges and solutions
APNIC
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 
MongoDB World 2018: Active-Active Application Architectures: Become a MongoDB...
MongoDB World 2018: Active-Active Application Architectures: Become a MongoDB...MongoDB World 2018: Active-Active Application Architectures: Become a MongoDB...
MongoDB World 2018: Active-Active Application Architectures: Become a MongoDB...
MongoDB
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 
IEEE SRDS'12: From Backup to Hot Standby: High Availability for HDFS
IEEE SRDS'12: From Backup to Hot Standby: High Availability for HDFSIEEE SRDS'12: From Backup to Hot Standby: High Availability for HDFS
IEEE SRDS'12: From Backup to Hot Standby: High Availability for HDFS
André Oriani
 
Delivering big content at NBC News with RavenDB
Delivering big content at NBC News with RavenDBDelivering big content at NBC News with RavenDB
Delivering big content at NBC News with RavenDB
John Bennett
 
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messagesMulti-Tenancy Kafka cluster for LINE services with 250 billion daily messages
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
LINE Corporation
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
DataWorks Summit
 

Similar to Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node (20)

Scaling HDFS at Xiaomi
Scaling HDFS at XiaomiScaling HDFS at Xiaomi
Scaling HDFS at Xiaomi
 
Scaling HDFS at Xiaomi
Scaling HDFS at XiaomiScaling HDFS at Xiaomi
Scaling HDFS at Xiaomi
 
RedisConf18 - Redis at LINE - 25 Billion Messages Per Day
RedisConf18 - Redis at LINE - 25 Billion Messages Per DayRedisConf18 - Redis at LINE - 25 Billion Messages Per Day
RedisConf18 - Redis at LINE - 25 Billion Messages Per Day
 
Storing eBay's Media Metadata on MongoDB, by Yuri Finkelstein, Architect, eBay
Storing eBay's Media Metadata on MongoDB, by Yuri Finkelstein, Architect, eBayStoring eBay's Media Metadata on MongoDB, by Yuri Finkelstein, Architect, eBay
Storing eBay's Media Metadata on MongoDB, by Yuri Finkelstein, Architect, eBay
 
MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB present...
MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB  present...MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB  present...
MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB present...
 
CephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at Last
 
10 Ways to Scale Your Website Silicon Valley Code Camp 2019
10 Ways to Scale Your Website Silicon Valley Code Camp 201910 Ways to Scale Your Website Silicon Valley Code Camp 2019
10 Ways to Scale Your Website Silicon Valley Code Camp 2019
 
ApacheCon BigData - What it takes to process a trillion events a day?
ApacheCon BigData - What it takes to process a trillion events a day?ApacheCon BigData - What it takes to process a trillion events a day?
ApacheCon BigData - What it takes to process a trillion events a day?
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
 
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
 
10 Ways to Scale with Redis - LA Redis Meetup 2019
10 Ways to Scale with Redis - LA Redis Meetup 201910 Ways to Scale with Redis - LA Redis Meetup 2019
10 Ways to Scale with Redis - LA Redis Meetup 2019
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Signing DNSSEC answers on the fly at the edge: challenges and solutions
Signing DNSSEC answers on the fly at the edge: challenges and solutionsSigning DNSSEC answers on the fly at the edge: challenges and solutions
Signing DNSSEC answers on the fly at the edge: challenges and solutions
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
MongoDB World 2018: Active-Active Application Architectures: Become a MongoDB...
MongoDB World 2018: Active-Active Application Architectures: Become a MongoDB...MongoDB World 2018: Active-Active Application Architectures: Become a MongoDB...
MongoDB World 2018: Active-Active Application Architectures: Become a MongoDB...
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
IEEE SRDS'12: From Backup to Hot Standby: High Availability for HDFS
IEEE SRDS'12: From Backup to Hot Standby: High Availability for HDFSIEEE SRDS'12: From Backup to Hot Standby: High Availability for HDFS
IEEE SRDS'12: From Backup to Hot Standby: High Availability for HDFS
 
Delivering big content at NBC News with RavenDB
Delivering big content at NBC News with RavenDBDelivering big content at NBC News with RavenDB
Delivering big content at NBC News with RavenDB
 
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messagesMulti-Tenancy Kafka cluster for LINE services with 250 billion daily messages
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
 

Recently uploaded

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 

Recently uploaded (20)

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 

Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node

  • 1. Consistent Reads from Standby Node Konstantin V Shvachko Sr. Staff Software Engineer @LinkedIn Chen Liang Senior Software Engineer @LinkedIn Chao Sun Software Engineer @Uber
  • 2. Agenda HDFS CONSISTENT READ FROM STANDBY 1 • Motivation • Consistency Read from Standby • Challenges • Design and Implementation • Next steps
  • 3. The Team 2 • Konstantin Shvachko (LinkedIn) • Chen Liang (LinkedIn) • Erik Krogen (LinkedIn) • Chao Sun (Uber) • Plamen Jeliazkov (Paypal)
  • 5. Motivation 4 • 2x Growth/Year In Workloads and Size • Approaching active Name Node performance limits rapidly • We need a scalability solution • Key Insights: • Reads comprise 95% of all metadata operations in our practice • Another source of truth for read: Standby Nodes • Standby Nodes Serving Read Requests • Can substantially decrease active Name Node workload • Allowing cluster to scale further!
  • 6. Architecture ROLE OF STANDBY NODES DataNodes Active NameNode Standby NameNodes JournalNodes Write 5 /Read• Standby nodes have same copy of all metadata (with some delay) • Standby Node syncs edits from Active NameNode • Standby nodes can potentially serve read requests • All reads can go to Standby nodes • OR, time critical applications can still choose to read from Active only
  • 7. Challenges DataNodes Active NameNode Standby NameNodes JournalNodes Write Read 6 /Read • Standby Node delay • ANN write edits to JN, then SbNN applying the edits from JN • With delay at minute magnitude • Consistency • If client performs a read after a write, client would expect to see the state change
  • 8. Fast Journaling DELAY REDUCTION 7 • Fast Edit Tailing HDFS-13150 • Current JN is slow: serving whole segments of edits from disk • Optimization on JN and SbNN o JN caching recent edits in memory, only applied edits are served o SbNN request only recent edits through RPC calls o Fall back to existing mechanism on error • Significantly reduce SbNN delay o Reduce from 1 minute to 2 to 50 milliseconds • Standby node delay is no more than a few ms in most cases
  • 9. Consistency Model 8 • Consistency Principle: • If client c1 modifies an object state at id1 at time t1, then in any future time t2 > t1, c1 will see the state of that object at id2 >= id1 • Read-Your-Own-Write • Client writes to Active NameNode • Then read from the StandbyNode • Read should reflect the write Active NameNode Standby NameNodes JournalNodes lastSeenStateId txnid = 100 = 100 txnid = 99 100 100
  • 10. txnid = 100 Consistency Model 9 • Consistency Principle: • If client c1 modifies an object state at id1 at time t1, then in any future time t2 > t1, c1 will see the state of that object at id2 >= id1 • LastSeenStateID • Monotonically increasing Id of ANN namespace state txnid • Kept on client side, client’s known most recent ANN state • Sent to SbNN, SbNN only replies after it has caught up to this state Active NameNode Standby NameNodes JournalNodes lastSeenStateId txnid = 100 = 100 100 100
  • 11. Corner Case: Stale Reads 10 • Stale Read Cases • Case1: Multiple client instances • DFSClient#1 to write to ANN, DFSClient#2 to read SbNN • DFSClient#2’s state older than DFSClient#1, read is out of sync • Case2: Out-of-band communication • Client#1 writes to ANN, inform client#2 • Client#2 read from SbNN, not see the write Active NameNode DFSClient#1 Standby NameNode Write DFSClient#2 Read Read your own writes Active NameNode DFSClient#1 Standby NameNode Write DFSClient#2 Read Third-party communication
  • 12. msync API 11 • Dealing with Stale Reads: FileSystem.msync() • Sync between existing client instances • Force the DFSClient to sync up to the most recent state of ANN • Multiple client instances: call msync on DFSClient#2 • Out-of-band communication: client#2 calls msync first before read • “Always msync” mode HDFS-14211 Active NameNode DFSClient#1 Standby NameNode Write DFSClient#2 Read Read your own writes Active NameNode DFSClient#1 Standby NameNode Write DFSClient#2 Read Third-party communication
  • 13. Robustness Optimization: Standby Node Back-off REDIRECT WHEN TOO FAR BEHIND • In the case where a Standby node state is too far behind, client may retry another node • e.g. Standby node machine running slow • Standby Node Back-off • 1: Upon receiving request, if Standby node finds itself too far behind requested state, it rejects the request, throwing retry exception • 2: If a request has been in queue for long, and Standby still is not caught up, Standby rejects the request, throwing retry exception • Client Retry • Upon retry exception, client tries a different standby node, or simply falling back to ANN 12
  • 14. Configuration and Startup Process 13 • Configuring NameNodes • Configure namenodes via haadmin • Observer mode is similar to Standby, but serves read and does not perform checkpointing • All NameNodes start as check pointing Standby, Standby can be transitioned to Active or Observer • Configuring Client • Configure to use ObserverReadProxyProvider • If not, client still works but only talks to ANN • ObserverReadProxy will discover the state of all NNs Active Standby Observer Check Pointing Standby Read Serving Standby Active
  • 15. Current Status 14 • Test and benchmark • With YARN application, e.g. TeraSort • With HDFS benchmarks, e.g. DFSIO • Run on a cluster with >100 nodes and with Kerberos and delegation token enabled • Merged to trunk (3.3.0) • Being backported to branch-2 • Active work on further improvement/optimization • Has been running at Uber in production
  • 16. Background ● Back in 2017, Uber’s HDFS clusters were in a bad shape ○ Rapid growth in # of jobs accessing HDFS ○ Ingestion & adhoc jobs co-locate on the same cluster ○ Lots of listing calls on very large directories (esp. Hudi) ● HDFS traffic composition: 96% reads, 4% writes ● Presto is very sensitive to HDFS latency ○ Occupies ~20% of HDFS traffic ○ Only reads from HDFS, no write
  • 17. Implementation & Timeline ● Implementation (compare to open source version) ○ No msync or fast edit log tailing ■ Only eventual consistency with max staleness of 10s ○ Observer was NOT eligible to NN failover ○ Batched edits loading to reduce NN locktime when tailing edits ● Timeline ○ 08/2017 - finished the POC and basic testing in dev clusters ○ 12/2017 - started collaborating with HDFS open source community (e.g., Linkedin, Paypal) ○ 12/2018 - fully rolled out to Presto in production ○ Tool multiple retries along the process ■ Disable access time (dfs.namenode.accesstime.precision) ■ HDFS-13898, HDFS-13924
  • 18. Impact Comparing to traffic goes to active NameNode, Observer NameNode improves the overall throughput by ~20% (roughly the same throughput from Presto), while RPC queue time has dropped ~30X.
  • 19. Impact (cont.) Presto listing status call latency has dropped 8-10X after migrating to Observer
  • 21. Three-Stage Scalability Plan 2X GROWTH / YEAR IN WORKLOADS AND SIZE • Stage I. Consistent reads from standby • Optimize for reads: 95% of all operations • Consistent reading is a coordination problem • Stage II. In-memory Partitioned Namespace • Optimize write operations • Eliminate NameNode’s global lock – fine-grained locking • Stage III. Dynamically Distributed Namespace Service • Linear scaling to accommodate increases in RPC load and metadata growth HDFS-12943 20
  • 22. NameNode Current State NAMENODE’S GLOBAL LOCK – PERFORMANCE BOTTLENECK • Three main data structures • INodeMap: id -> INode • BlocksMap: key -> BlockInfo • DatanodeMap: don’t split • GSet – an efficient HashMap implementation • Hash(key) -> Value • Global lock to update INodes and blocks 21 NameNode – FSNamesystem INodeMap – Directory Tree GSet: Id -> INode BlocksMap – Block Manager GSet: Block -> BlockInfo DataNode Manager
  • 23. Stage II. In-memory Partitioned Namespace ELIMINATE NAMENODE’S GLOBAL LOCK • PartitionedGSet: • two level mapping 1. RangeMap: keyRange -> GSet 2. RangeGSet: key -> INode • Fine-grained locking • Individual locks per range • Different ranges are accessed in parallel 22 NameNode GSet-1 DataNode Manager GSet-2 GSet-n GSet-1 GSet-2 GSet-n INodeMap - Partitioned GSet BlocksMap - Partitioned GSet
  • 24. Stage II. In-memory Partitioned Namespace EARLY POC RESULTS 23 • PartitionedGSet: two level mapping • LatchLock: swap RangeMap lock for GSet locks corresponding to inode keys • Run NNTroughputBenchmark creating 10 million directories • 30% throughput gain • Large batches of edits • Why not 100%? • Key is inodeId – incremental number generator • Contention on the last partition • Expect MORE
  • 25. Stage III. Dynamically Distributed Namespace SCALABLE DATA AND METADATA • Split NameNode state into multiple servers based on ranges • Each NameNode • Serves a designate range of INode keys • Metadata in PartitionedGSet • Can reassign certain subranges to adjacent nodes • Coordination Service (Ratis) • Change ranges served by NNs • Renames / moves, Quotas 24 NameNode 1 INodeMap Part-GSet DataNode Manager BlocksMap Part-GSet NameNode 2 NameNode n INodeMap Part-GSet DataNode Manager BlocksMap Part-GSet INodeMap Part-GSet DataNode Manager BlocksMap Part-GSet
  • 26. Thank You! Konstantin V Shvachko Chen Liang Chao Sun Sr. Staff Software Engineer @LinkedIn Software Engineer @Uber Senior Software Engineer @LinkedIn 25 Consistent Reads from Standby Node

Editor's Notes

  1. State transition diagram
  2. Winter is coming!
  3. See Appendix. The SlideShare version will have more details about the Satellite Cluster configuration and operational solutions