Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node

Consistent Reads from Standby Node
Konstantin V Shvachko
Sr. Staff Software Engineer
@LinkedIn
Chen Liang
Senior Software Engineer
@LinkedIn
Chao Sun
Software Engineer
@Uber

Agenda
HDFS CONSISTENT READ FROM STANDBY
1
• Motivation
• Consistency Read from Standby
• Challenges
• Design and Implementation
• Next steps

The Team
2
• Konstantin Shvachko (LinkedIn)
• Chen Liang (LinkedIn)
• Erik Krogen (LinkedIn)
• Chao Sun (Uber)
• Plamen Jeliazkov (Paypal)

Consistent Reads From
Standby Nodes

Motivation
4
• 2x Growth/Year In Workloads and Size
• Approaching active Name Node performance limits rapidly
• We need a scalability solution
• Key Insights:
• Reads comprise 95% of all metadata operations in our practice
• Another source of truth for read: Standby Nodes
• Standby Nodes Serving Read Requests
• Can substantially decrease active Name Node workload
• Allowing cluster to scale further!

Architecture
ROLE OF STANDBY NODES
DataNodes
Active
NameNode
Standby
NameNodes
JournalNodes
Write
5
/Read• Standby nodes have same copy of all
metadata (with some delay)
• Standby Node syncs edits from Active
NameNode
• Standby nodes can potentially serve read
requests
• All reads can go to Standby nodes
• OR, time critical applications can still
choose to read from Active only

Challenges
DataNodes
Active
NameNode
Standby
NameNodes
JournalNodes
Write Read
6
/Read
• Standby Node delay
• ANN write edits to JN, then SbNN
applying the edits from JN
• With delay at minute magnitude
• Consistency
• If client performs a read after a write,
client would expect to see the state
change

Fast Journaling
DELAY REDUCTION
7
• Fast Edit Tailing HDFS-13150
• Current JN is slow: serving whole segments of edits from disk
• Optimization on JN and SbNN
o JN caching recent edits in memory, only applied edits are served
o SbNN request only recent edits through RPC calls
o Fall back to existing mechanism on error
• Significantly reduce SbNN delay
o Reduce from 1 minute to 2 to 50 milliseconds
• Standby node delay is no more than a few ms in most cases

Consistency Model
8
• Consistency Principle:
• If client c1 modifies an object state at id1
at time t1, then in any future time t2 > t1,
c1 will see the state of that object at id2 >=
id1
• Read-Your-Own-Write
• Client writes to Active NameNode
• Then read from the StandbyNode
• Read should reflect the write
Active
NameNode
Standby
NameNodes
JournalNodes
lastSeenStateId
txnid = 100
= 100
txnid = 99
100 100

txnid = 100
Consistency Model
9
• Consistency Principle:
• If client c1 modifies an object state at id1
at time t1, then in any future time t2 > t1,
c1 will see the state of that object at id2 >=
id1
• LastSeenStateID
• Monotonically increasing Id of ANN
namespace state txnid
• Kept on client side, client’s known most
recent ANN state
• Sent to SbNN, SbNN only replies after it
has caught up to this state
Active
NameNode
Standby
NameNodes
JournalNodes
lastSeenStateId
txnid = 100
= 100
100 100

Corner Case: Stale Reads
10
• Stale Read Cases
• Case1: Multiple client instances
• DFSClient#1 to write to ANN, DFSClient#2 to
read SbNN
• DFSClient#2’s state older than DFSClient#1,
read is out of sync
• Case2: Out-of-band communication
• Client#1 writes to ANN, inform client#2
• Client#2 read from SbNN, not see the write
Active
NameNode
DFSClient#1
Standby
NameNode
Write
DFSClient#2
Read
Read your own writes
Active
NameNode
DFSClient#1
Standby
NameNode
Write
DFSClient#2
Read
Third-party communication

msync API
11
• Dealing with Stale Reads: FileSystem.msync()
• Sync between existing client instances
• Force the DFSClient to sync up to the most
recent state of ANN
• Multiple client instances: call msync on
DFSClient#2
• Out-of-band communication: client#2 calls
msync first before read
• “Always msync” mode HDFS-14211
Active
NameNode
DFSClient#1
Standby
NameNode
Write
DFSClient#2
Read
Read your own writes
Active
NameNode
DFSClient#1
Standby
NameNode
Write
DFSClient#2
Read
Third-party communication

Robustness Optimization: Standby Node Back-off
REDIRECT WHEN TOO FAR BEHIND
• In the case where a Standby node state is too far behind, client may retry another node
• e.g. Standby node machine running slow
• Standby Node Back-off
• 1: Upon receiving request, if Standby node finds itself too far behind requested state, it
rejects the request, throwing retry exception
• 2: If a request has been in queue for long, and Standby still is not caught up, Standby
rejects the request, throwing retry exception
• Client Retry
• Upon retry exception, client tries a different standby node, or simply falling back to
ANN 12

Configuration and Startup Process
13
• Configuring NameNodes
• Configure namenodes via haadmin
• Observer mode is similar to Standby, but serves
read and does not perform checkpointing
• All NameNodes start as check pointing Standby,
Standby can be transitioned to Active or Observer
• Configuring Client
• Configure to use ObserverReadProxyProvider
• If not, client still works but only talks to ANN
• ObserverReadProxy will discover the state of all
NNs
Active
Standby
Observer
Check
Pointing
Standby
Read
Serving
Standby
Active

Current Status
14
• Test and benchmark
• With YARN application, e.g. TeraSort
• With HDFS benchmarks, e.g. DFSIO
• Run on a cluster with >100 nodes and with Kerberos and delegation token enabled
• Merged to trunk (3.3.0)
• Being backported to branch-2
• Active work on further improvement/optimization
• Has been running at Uber in production

Background
● Back in 2017, Uber’s HDFS clusters were in a bad shape
○ Rapid growth in # of jobs accessing HDFS
○ Ingestion & adhoc jobs co-locate on the same cluster
○ Lots of listing calls on very large directories (esp. Hudi)
● HDFS traffic composition: 96% reads, 4% writes
● Presto is very sensitive to HDFS latency
○ Occupies ~20% of HDFS traffic
○ Only reads from HDFS, no write

Implementation & Timeline
● Implementation (compare to open source version)
○ No msync or fast edit log tailing
■ Only eventual consistency with max staleness of 10s
○ Observer was NOT eligible to NN failover
○ Batched edits loading to reduce NN locktime when tailing edits
● Timeline
○ 08/2017 - finished the POC and basic testing in dev clusters
○ 12/2017 - started collaborating with HDFS open source community (e.g.,
Linkedin, Paypal)
○ 12/2018 - fully rolled out to Presto in production
○ Tool multiple retries along the process
■ Disable access time (dfs.namenode.accesstime.precision)
■ HDFS-13898, HDFS-13924

Impact
Comparing to traffic goes to active NameNode, Observer NameNode
improves the overall throughput by ~20% (roughly the same throughput
from Presto), while RPC queue time has dropped ~30X.

Impact (cont.)
Presto listing status call latency has dropped 8-10X after migrating to
Observer

Three-Stage Scalability Plan
2X GROWTH / YEAR IN WORKLOADS AND SIZE
• Stage I. Consistent reads from standby
• Optimize for reads: 95% of all operations
• Consistent reading is a coordination problem
• Stage II. In-memory Partitioned Namespace
• Optimize write operations
• Eliminate NameNode’s global lock – fine-grained locking
• Stage III. Dynamically Distributed Namespace Service
• Linear scaling to accommodate increases in RPC load and metadata growth
HDFS-12943
20

NameNode Current State
NAMENODE’S GLOBAL LOCK – PERFORMANCE BOTTLENECK
• Three main data structures
• INodeMap: id -> INode
• BlocksMap: key -> BlockInfo
• DatanodeMap: don’t split
• GSet – an efficient HashMap
implementation
• Hash(key) -> Value
• Global lock to update INodes and
blocks
21
NameNode – FSNamesystem
INodeMap – Directory Tree
GSet: Id -> INode
BlocksMap – Block Manager
GSet: Block -> BlockInfo
DataNode Manager

Stage II. In-memory Partitioned Namespace
ELIMINATE NAMENODE’S GLOBAL LOCK
• PartitionedGSet:
• two level mapping
1. RangeMap: keyRange -> GSet
2. RangeGSet: key -> INode
• Fine-grained locking
• Individual locks per range
• Different ranges are accessed
in parallel
22
NameNode
GSet-1
DataNode Manager
GSet-2 GSet-n
GSet-1 GSet-2 GSet-n
INodeMap - Partitioned GSet
BlocksMap - Partitioned GSet

Stage II. In-memory Partitioned Namespace
EARLY POC RESULTS
23
• PartitionedGSet: two level mapping
• LatchLock: swap RangeMap lock for GSet locks corresponding to inode keys
• Run NNTroughputBenchmark creating 10 million directories
• 30% throughput gain
• Large batches of edits
• Why not 100%?
• Key is inodeId – incremental number generator
• Contention on the last partition
• Expect MORE

Stage III. Dynamically Distributed Namespace
SCALABLE DATA AND METADATA
• Split NameNode state into
multiple servers based on ranges
• Each NameNode
• Serves a designate range of
INode keys
• Metadata in PartitionedGSet
• Can reassign certain
subranges to adjacent nodes
• Coordination Service (Ratis)
• Change ranges served by NNs
• Renames / moves, Quotas
24
NameNode 1
INodeMap
Part-GSet
DataNode
Manager
BlocksMap
Part-GSet
NameNode 2 NameNode n
INodeMap
Part-GSet
DataNode
Manager
BlocksMap
Part-GSet
INodeMap
Part-GSet
DataNode
Manager
BlocksMap
Part-GSet

Thank You!
Konstantin V Shvachko Chen Liang Chao Sun
Sr. Staff
Software Engineer
@LinkedIn
Software Engineer
@Uber
Senior
Software Engineer
@LinkedIn
25
Consistent Reads from Standby Node

Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node

Similar to Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node (20)

Recently uploaded

Recently uploaded (20)

Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node

Editor's Notes