August 2013 HUG: Removing the NameNode's memory limitation

REMOVING THE
NAMENODE'S MEMORY
LIMITATION
Lin Xiao
Intern@Hortonworks
PhD student @ Carnegie Mellon University
18/22/2013

About Me: Lin Xiao
• Phd Student at CMU
• Advisor: Garth Gibson
• Thesis area – scalable distributed file systems
• Intern at Hortonworks
• Intern project: removing the Namenode memory limitation
• Email: lxiao+@cs.cmu.edu
28/22/2013

Big Data
• We create 2.5x1018 bytes of data per day [IBM]
• Sloan Digital Sky Survey: 200GB/night
• Facebook: 240 billions of photos till Jan,2013
• 250 million photos uploaded daily
• Cloud storage
• Amazon: 2 trillion objects, peak1.1 million op/sec
• Need scalable storage systems
• Scalable metadata <- focus of this presentation
• Scalable storage
• Scalable IO
38/22/2013

Scalable Storage Systems
• Separate data and metadata servers
• More data nodes for higher throughput & capacity
• Bulk of work – the IO path - is done by data servers
• Not much work added to metadata servers?
48/22/2013

Federated HDFS
• Namenodes(MDS) see their own namespace (NS)
• Each datanode can serve all namenodes
5
!
!
"""! """! """!
!!!!!!!!!!#$%!
!
#$!&!
"""! """!
!!!!!!!!!!#$!' !
( )*+' !, **)-!
. /0/&*12!%! . /0/&*12!3! . /0/&*12!4 !
" ##$!%!" ##$!!&!" ##$!!' !
#/4 2&*12!%! #/4 2&*12!' ! #/4 2&*12!&!
8/22/2013

Single Namenode
• Stores all metadata in memory
• Design is simple
• Provide low latency and high throughput metadata operations
• Support up to 3K data servers
• Hadoop clusters make it affordable to store old data
• Cold data is stored in the cluster for a long time
• Take up memory space but rarely used
• Growth of data size can exceed throughput
• Goal: remove space limits while maintain similar
performance
68/22/2013

Metadata in Namenode
• Namespace
• Stored as a linked tree structure by inodes
• Always visit from the top for any operation
• Blocks Map: block_id to location mapping
• Handle separately for huge number of blocks
• Datanode status
• IPaddress, capacity, load, heartbeat status, Block report status
• Leases
• Namespace and Block map uses the majority of memory
• This talk will focus on the Namespace
78/22/2013

Problem and Proposed Solution
• Problem:
• Remove namespace limit while maintain similar performance when
the working set can fit in memory
• Solution
• Retain the same namespace tree structure
• Store the namespace in persistent store using LSM (LevelDB)
• No separate edit logs nor checkpoints
• All Inode and their updates are persistent via LevelDB
• Fast startup, with the cost of slow initial operations
• Could prefetch inodes in
• Do not expect customers to drastically reduce the actual heap size
• Larger heap benefits transition between different working sets as
applications and workload changes
• A customer may occasionally run queries against cold data
88/22/2013

New Namenode Architecture
• Namespace
• Same as before, but only part of the tree is in memory
• On cache miss, read from levelDB
• Edit logs and checkpoints are replaced by LevelDB
• Update to LevelDB for every inode change
• Key: <parent_inode_number + name>
9
Namenode
Inode
edit
logs
Namenode
Inode
Inode
levelDB
buffer
WAL
LevelDB
Inode
levelDB
buffer
WAL
LevelDB
8/22/2013

Comparison w/Traditional FileSystem
• Traditional File Systems
• VFS layer keeps inode and directory entry cache
• Goal is to support the work load of single machine
• Relatively large number of files
• Support the applications from a single machine or in case of NFS from a
larger number of client machines
• Much much smaller workload and size compared to Hadoop use cases
• LevelDB based Namenode
• Support very large traffic of Hadoop cluster
• Keep a much larger number of INodes in memory
• Cache replacement policies to suite the Hadoop work load
• Data is in Datanodes
108/22/2013

LevelDB
• A fast key-value storage library written at Google
• Basic operations: get, put, delete
• Concurrency: single process w/multiple threads
• By default, writes are asynchronous
• As long as the machine doesn’t crash, it’s safe.
• Support synchronous writes
• No separate sync() operation
• Can be implemented by sync write/delete
• Support batch updates
• Data is automatically compressed using the Snappy
118/22/2013

Cache Replacement Policy
• Only whole directories are replaced in or out
• Hot dirs are all in cache, others will require levelDB scan
• Future – don’t cache very large dirs?
• No need to read from disks to check file existence
• LRU replacement policy
• Use CLOCK to approximate to reduce cost
• Separate thread for cache replacement
• Start replacement when threshold is exceeded
• Remove eviction out of sessions with lock
128/22/2013

Benchmark description
• NNThroughputBenchmark
• No RPC cost, call FileSystem method directly
• All operations are generated based on BFS order
• Each thread gets one portion of the work
• NN Load generator using YCSB++ framework (in progress)
• Normal HDFS client calls
• Thread either works in their own namespace, or choose randomly
• Load generator based on real cluster traces (in progress)
• Can you help me get traces from your cluster?
• Traditional Hadoop benchmark(in progress)
• E.g. Gridmix Expect little degradation when most work is for data
transfer
138/22/2013

Categories of tests
• Everything fits in memory
• Goal: should be almost the same as the current NN
• Working set does not fit in memory or changes over time
• Study various cache replacement policies
• Need to get good traces from real cluster to see patterns of hot,
warm and cold data
148/22/2013

Experiment Setup
• Hardware description (Susitna)
• CPU: AMD Opteron 6272, 64 bit, 16 MB L2, 16-core 2.1 GHz
• SSD: Crucial M4-CT064M4SSD2 SSD, 64 GB, SATA 6.0Gb/s
• (In progress) Use disks in future experiments
• Heap size is set to 1GB
• Multiple threads, but each thread gets one portion of the work
• Each directory contains 100 subdirs and 100 files
• Named sequentially: ThroughputBenchDir1, ThroughputBench1
• LevelDB NN
• Cache monitor thread starts replacement when 90% full
158/22/2013

Create & close 2.4M files – all fit in cache
0
1000
2000
3000
4000
5000
6000
7000
8000
2 4 8 16
Throughputops/sec
Number of Threads
original
w/LevelDB
16
• Note files are not accessed, but clearly parent dirs are
• Note: Old NN and LevelDB NN peak at different # threads
• Degradation for peak throughput is 13.5%
8/22/2013

Create 9.6M files: 1% fits in cache
• Old NN with 8 threads and LevelDB NN with 16 threads.
• Performance remains about the same using LevelDB
• Namenode’s throughput drops to zero when memory exhausted
17
0
1000
2000
3000
4000
5000
6000
7000
20
120
220
320
420
520
620
720
820
920
1020
1120
1220
1320
1420
1520
1620
ThroughputOps/sec
Time in seconds
Original
LevelDB NN
8/22/2013

GetFileInfo
18
• ListStatus of first 600K of 2.4M files
• Each thread working on different part of tree
• Original NN: all fit in memory (of course)
• LevelDB NN: 2 cases: (1) all fit, (2) half fit
• Half fit: 10%-20% degradation - cache is constantly replaced
0
20000
40000
60000
80000
100000
120000
140000
2 4 8 16 32
ThroughputOps/sec
Number of Threads
Original
FitCache
HalfInCache
8/22/2013

Benchmarks that remain
• Each thread gets one portion of the work
• NN Load generator using YCSB++ framework (in progress)
• Normal HDFS client calls
• Thread either works in their own namespace, or choose randomly
• Load generator based on real cluster traces (in progress)
• Can you help me get traces from your cluster?
• Traditional Hadoop benchmark(in progress)
• E.g. Gridmix Expect little degradation when most work is for data
transfer
198/22/2013

Summary
• Now that NN is HA, removing the namespace memory
limitation is one of most important problems to solve
• LSM (LevelDB) has worked out quite well
• Initial experiments have shown good results
• Need further benchmarks especially on how effective caching is for
different workloads and patterns
• Other LSM implementations? (e.g.HBase’s Java LSM)
• Work is done on branch 0.23
• Graduate student quality prototype (very good graduate student )
• But worked closed with the HDFS experts at Hortonworks
• Goal of internship was to see how well the idea worked
• Hortonworks plans to take this to the next stage once more experiments
are completed.
208/22/2013

Q&A
• Contact: lxiao+@cs.cmu.edu
• We’d love to get trace stats from your cluster 
• Simple java program to run against your audit logs
• Can also run as Mapreduce jobs
• Extract metadata operation stats without exposing sensitive info
• Please contact me if you could help!
218/22/2013

August 2013 HUG: Removing the NameNode's memory limitation

More Related Content

What's hot

Viewers also liked

Similar to August 2013 HUG: Removing the NameNode's memory limitation

More from Yahoo Developer Network

Recently uploaded

August 2013 HUG: Removing the NameNode's memory limitation