Maintaining Strong Consistency Semantics in a Horizontally Scalable and Highly Available NameNode of HDFS

Maintaining
Strong Consistency Semantics
in a Horizontally Scalable and Highly Available
NameNode of HDFS
Hooman Peiro Sajjad
Kamal Hakimzadeh
Supervisor: Dr. Jim Dowling

Outline
●Introduction & Background
●Problem Definition
●Solution
●Implementation
●Evaluation
●Future Work

Avro
Chukwa
HBase
Mahout
ZooKeeper

What is HDFS
Commodity hardware
Big file system
Relaxed POSIX
High Throughput
Large data sets

l Limitations
10 PB Capacity
10,000 DataNodes
100,000 Clients
100 Millions of Files
21 PB Capacity
2000 DataNodes
30,000 Clients
14 PB Capacity
4000 DataNodes
15,000 Clients

HDFS Operations
Filesystem Operations(25)
cat, cp, ls ...
Primitive Operations(70)
startFile
getAdditionalBlock
blockReceivedAndDeleted

KTHFS
More Storage
Goals
Scaling out NameNode
Highly Available NameNode
Preserving HDFS API's Semantic

KTHFS - Architecture
Transactions

Why Mysql-Cluster?
No single point of failure
99.999% availability
Horizontally scalable
High throughput
Real-time transactions

First KTHFS Implementation
1.Multiple NameNode is inconsistent
1.System-level lock is a throughput bottleneck
1.Unreasonably excessive roundtrips

HDFS is strongly consistent
Blocks are replicated completely
Metadata protected by system-level lock
HDFS Consistency Model

Eliminating System Level Lock
No more Lock in the NameNode
Mysql - Read-committed isolation level
Mysql - supports lock

Fuzzy and Phantom Read
Fuzzy read
Phantom read

Mysql-Cluster Isolation level is
READ_COMMITED
Isolation Level Dirty Read Fuzzy Read Phantom Read
Read uncommitted Possible Possible Possible
Read committed Not possible Possible Possible
Repeatable read
(Snapshot isolation)
Not possible Not possible Possible
Serializable Not possible Not possible Not possible

Snapshot Isolation
-----------------------------------------------------------------------------------------------------------------------
Algorithm: Snapshot-isolation schema
-----------------------------------------------------------------------------------------------------------------------
initially: snapshot.clear;
operation doOperation
tx.begin
snapshotting()
performTask()
tx.commit
operation snapshotting
foreach x in op do
snapshot <- tx.find(x.query)
operation performTask
//Operation Body, referring to cache for data
-----------------------------------------------------------------------------------------------------------------------
Consistent snapshot of data
Commit if no conflicting updates
No fuzzy read
Prevent modification conflict:
Optimistic
Pessimistic

Row level locking
-----------------------------------------------------------------------------------------------------------------------
Algorithm: Snapshot-isolation with row-level lock schema
-----------------------------------------------------------------------------------------------------------------------
tx.begin
snapshotting()
performTask()
tx.commit
foreach x in op do
tx.lockLevel(x.lockType)
-----------------------------------------------------------------------------------------------------------------------
Conflict prevention instead of resolution
Supported by Mysql-Cluster
Lock level affects parallelization factor

Maintaining HDFS Semantics
Does Snapshot + Lock ensure correctness of all HDFS operations? No
Independent mutations but semantically incorrect !!!

Parent Lock
●Semantically related objects
eg. "/d1/d2" has quota limit = 1
t1
t2
checkQuota("/d1/d2") checkQuota("/d1/d2")
addInode("/d1/d2", "foo.mp3") addInode("/d1/d2", "bar.mp4")
Exceeded quota limit!!!
●Phantom read
t1
t2
countBlocks("/d1/d2/foo.flv")
addBlock("/d1/d2/foo.flv")
....
countBlocks("/d1/d2/foo.flv")

Deadlock
1. Conflicting lock order
1. Lock upgrade

Total Order Locking
Total order rule:
Notations:
X= {x | x is a metadata object}
R = {r | r is a read operation}
W = {w | w is a write operation}
Serialization rule:

Complete Locking Solution
-----------------------------------------------------------------------------------------------------------------------
Algorithm: Snapshot-isolation with total ordered row-level lock schema
-----------------------------------------------------------------------------------------------------------------------
tx.begin
snapshotting()
performTask()
tx.commit
S = total_order_sort(op.X)
foreach x in S do
if x is a parent then level = x.parent_level_lock
else level = x.strongest_lock_type
tx.lockLevel(level)
-----------------------------------------------------------------------------------------------------------------------
Conflicting orders -> Total Order Locking
Lock upgrade -> Acquire strongest required lock-level
Semantically related -> Parent Lock

Total order of NameNode metadata
S
t
e
p
Metadata Objects
1 Directory#1(root)
2 Directory#2
..
.
...
n Directory#n
n
+
1
File
n
+
Block-Infos, Leases

Operations Implemented as Multi
transactions
Phase Ste
p
Metadata Objects
t1:No Lock 1 given-blocks
t2:Basic-Order 1 file
t2:Basic-Order 2 block
t2:Basic-Order 3 replicas, corrupted-replicas, excess-replicas,
under-replicated-block, pending-block,
replicas-under-construction, invalidated-blocks

Safety and Liveness
1.Single transaction operations
S: Fine grain serialization
L: Total-order lock + no lock upgrade
2. Multi transaction operations
S: No dependencies in group of
metadata
+ No mutation in 1st transaction
+ validation in 2nd transaction
L: Limited number of retries

Transaction Context as Snapshot

System-level vs. Row-level locking

Future Work...
Relaxing Locks
Inter transaction cache
Optimistic Transaction Conflict Resolution

Maintaining Strong Consistency Semantics in a Horizontally Scalable and Highly Available NameNode of HDFS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Maintaining Strong Consistency Semantics in a Horizontally Scalable and Highly Available NameNode of HDFS

Similar to Maintaining Strong Consistency Semantics in a Horizontally Scalable and Highly Available NameNode of HDFS (20)

Recently uploaded

Recently uploaded (20)

Maintaining Strong Consistency Semantics in a Horizontally Scalable and Highly Available NameNode of HDFS

Editor's Notes