Scaling HDFS to Manage Billions of Files with Key-Value Stores

Scaling HDFS to Manage Billions of Files
Haohui Mai, Jing Zhao

Hortonworks, Inc.

About the speakers
• Haohui Mai
• Active committers and PMC in Hadoop
• Ph.D. in Computer Science from UIUC in 2013
• Joined the HDFS team in Hortonworks
• 250+ commits in Hadoop

About the speakers
• Jing Zhao
• Active committers and PMC in Hadoop
• Ph.D. in Computer Science from USC in 2012
• HDFS team member in Hortonworks
• 250+ commits in Hadoop

Past: the scale of data
• In 2007 (PC)
• ~500 GB hard drives
• thousands of ﬁles

• In 2007 (PC)
• ~500 GB hard drives
• thousands of ﬁles
• In 2007 (Hadoop)
• several hundred nodes
• several hundred TBs
• millions of ﬁles

• In 2015
• 4,000+ nodes (10x)
• 150+ PBs (1000x)
• 400M+ ﬁles (100x)

Present: a generic storage system

• SQL-On-Hadoop
• Machine learning
• Real-time analytics
• Data streaming
• File archives, NFS…

• SQL-On-Hadoop
• Machine learning
• Real-time analytics
• Data streaming
• File archives, NFS… 
• From MR-centric ﬁlesystem to a
generic distributed storage system

Future: Billions of ﬁles in HDFS

• HDFS clusters continue to grow

• New use cases emerge
• IoT, time series data…

• New use cases emerge
• IoT, time series data…
• Files are natural abstractions of
data
• Few big files → many small
files in HDFS
• Billions of files in a few years

NameNode limits the scale
• Master / slave architecture
• All metadata in NN, data
across multiple DNs
• Simple and robust
NN
DN DN DN

NameNode limits the scale
• Master / slave architecture
• All metadata in NN, data
across multiple DNs
• Simple and robust
• Does not scale beyond the size
of the NN heap
• 400M ﬁles ~ 128G heap
• GC pauses
NN
DN DN DN

Next-gen arch: HDFS on top of KV stores

• Namespace (NS) on top of Key-
Value (KV) stores
• Storing the NS into LevelDB

Value (KV) stores
• Working set ﬁts in memory,
cold metadata on disks
• Match the usage patterns of
HDFS
Namespace

Value (KV) stores
• Working set ﬁts in memory,
cold metadata on disks
• Match the usage patterns of
HDFS
• Low adoption cost: fully
compatible
Namespace

• Introduction
• Namespace on top of KV stores
• Evaluation
• Future work & conclusions

Encoding the NS as KV pairs
• inode_id → ﬂat binary
representation of inode
• avoid serialization costs
• <pid,foo> → the inode id of
the child foo whose parent’s
inode id is pid
struct inode {
uint64_t inode_id;
uint64_t parent_id;
uint32_t flags;
…
};

Encoding directories
root
foo
bar

Encoding directories
Key Value
1 root
2 foo
3 bar
<1,foo> 2
<2,bar> 3
root
foo
bar

Resolving paths
Key Value
1 root
2 foo
3 bar
<1,foo> 2
<2,bar> 3

Resolving paths
Key Value
1 root
2 foo
3 bar
<1,foo> 2
<2,bar> 3
/ foo / bar

Resolving paths
Key Value
1 root
2 foo
3 bar
<1,foo> 2
<2,bar> 3
/ foo / barID: 1

Resolving paths
Key Value
1 root
2 foo
3 bar
<1,foo> 2
<2,bar> 3
/ foo / barID: 1
<1,foo>

Resolving paths
Key Value
1 root
2 foo
3 bar
<1,foo> 2
<2,bar> 3
/ foo / bar
<1,foo>
ID: 2

Resolving paths
Key Value
1 root
2 foo
3 bar
<1,foo> 2
<2,bar> 3
/ foo / bar
ID: 2
<2,bar>

Resolving paths
Key Value
1 root
2 foo
3 bar
<1,foo> 2
<2,bar> 3
/ foo / bar
<2,bar>
ID: 3

Listing directories
Key Value
1 root
2 foo
3 bar
<1,foo> 2
<2,bar> 3

Listing directories
Key Value
1 root
2 foo
3 bar
<1,foo> 2
<2,bar> 3
$ ls /

Listing directories
Key Value
1 root
2 foo
3 bar
<1,foo> 2
<2,bar> 3
$ ls /
ID: 1

Listing directories
Key Value
1 root
2 foo
3 bar
<1,foo> 2
<2,bar> 3
$ ls /
ID: 1
<1,*>

Listing directories
Key Value
1 root
2 foo
3 bar
<1,foo> 2
<2,bar> 3
$ ls /
foo
ID: 1
<1,*>

Integrate with existing HDFS features

• HDFS snapshots
• Metadata only operations
• Append version ids for each key
• Map between snapshot ids and version ids

• HDFS snapshots
• Metadata only operations
• Append version ids for each key
• Map between snapshot ids and version ids
• NameNode High Availability (HA)
• Use edit logs instead of the WAL of the KV stores to persist operations
• Minimal changes in the current HA mechanisms

Current status
• Phase I — NS on top of KV interfaces (HDFS-8286)
• NS on top of an in-memory KV store
• Under active development
• Phase II — Partial NS in the memory
• Working set of the NS in the memory, cold metadata on disks
• Scaling the NS beyond the size of heap

Evaluation
• 4 node clusters, connected with 10GbE networks
• 6-core Intel Xeon E5-2630 @ 2.3GHz * 2, 64G DDR3 @ 1333 MHz
• WDC 1TB disks @ 7200 RPM, 64MB cache
• OpenJDK 1.8.0_25

Evaluation (cont.)
Apache Hadoop
2.7.0
2.7.0 InMem LevelDB
NS on top of an in-
memory KV map
NS on top of
LevelDB

NNThroughput (write)
• 300,000 operations
• 8 threads
• Comparable performance
• Simpler implementation
on delete()
• Syncing edit logs is the
bottleneck
Throughput(ops/s) 0
125
250
375
500
create mkdirs delete rename
2.7.0 InMem LevelDB

NNThroughput (read)
• Read-only operations
Throughput(ops/s)
0
50000
100000
150000
200000
open ﬁleStatus
2.7.0 InMem
LevelDB LevelDB-opt

NNThroughput (read)
• 1/3 throughput of vanilla
LevelDB v.s. 2.7.0
• Contentions of the global lock
in LevelDB during get()
Throughput(ops/s)
0
50000
100000
150000
200000
open ﬁleStatus
2.7.0 InMem
LevelDB LevelDB-opt

NNThroughput (read)
• 1/3 throughput of vanilla
LevelDB v.s. 2.7.0
• Contentions of the global lock
in LevelDB during get()
• A lock-free fast path of get() to
recover the performance
(LevelDB-opt)
Throughput(ops/s)
0
50000
100000
150000
200000
open ﬁleStatus
2.7.0 InMem
LevelDB LevelDB-opt

YCSB: Throughput
• YCSB against HBase 1.0.1.1
• Enabled short-circuit reads
• 100 threads, 10M records
Throughput(ops/s)
0
30000
60000
90000
120000
A B C F D E
2.7.0 InMem LevelDB

Latency(us)
0
2500
5000
7500
10000
A-Read
B-Read
C
-Read
F-RM
W
F-Read
E-Scan
2.7.0 InMem LevelDB
YCSB: Latency

Runtime(s)
0
350
700
1050
1400
TeraG
en
TeraSort
TeraValidate
2.7.0 InMem LevelDB
TeraSort
• 10G data per node
• Comparable performance

Throughput(op/s)
0
32500
65000
97500
130000
Working set
256M
128M
64M
32M
Impact on working set size
• Throughput of GetFileStatus()
under different sizes of working
sets

Future work
• Implementation and stabilization
• Operation concerns
• Compaction and fsck
• Failure recovery
• Cold startup

Conclusions
• HDFS needs to continue to scale

Conclusions
• Evolve HDFS towards KV-based architecture

Conclusions
• Scaling beyond the size of the NN heap

Conclusions
• Scaling beyond the size of the NN heap
• Preliminary evaluation looks promising

Acknowledgement
• Xiao Lin, interned with Hortonworks in 2013
• PoC implementation of LevelDB backed namespace
• Zhilei Xu, interned with Hortonworks in 2014
• Integration between various HDFS features and LevelDB
• Performance tuning

Integrating with LevelDB
Write operations in HDFS 
• Updating LevelDB inside the
global lock
• New LevelDB::Write() w/o
blocking calls
• Write edit log
• logSync()

Integrating with LevelDB
Write operations in HDFS 
• Updating LevelDB inside the
global lock
• New LevelDB::Write() w/o
blocking calls
• Write edit log
• logSync()
Pruning edit logs
• Dump memtable into the disks
• Update MANIFEST
• Prune edit logs

Scaling HDFS to Manage Billions of Files with Key-Value Stores

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Scaling HDFS to Manage Billions of Files with Key-Value Stores

Similar to Scaling HDFS to Manage Billions of Files with Key-Value Stores (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Scaling HDFS to Manage Billions of Files with Key-Value Stores