REMOVING THE
NAMENODE'S MEMORY
LIMITATION
Lin Xiao
Intern@Hortonworks
PhD student @ Carnegie Mellon University
18/22/2013
About Me: Lin Xiao
• Phd Student at CMU
• Advisor: Garth Gibson
• Thesis area – scalable distributed file systems
• Intern at Hortonworks
• Intern project: removing the Namenode memory limitation
• Email: lxiao+@cs.cmu.edu
28/22/2013
Big Data
• We create 2.5x1018 bytes of data per day [IBM]
• Sloan Digital Sky Survey: 200GB/night
• Facebook: 240 billions of photos till Jan,2013
• 250 million photos uploaded daily
• Cloud storage
• Amazon: 2 trillion objects, peak1.1 million op/sec
• Need scalable storage systems
• Scalable metadata <- focus of this presentation
• Scalable storage
• Scalable IO
38/22/2013
Scalable Storage Systems
• Separate data and metadata servers
• More data nodes for higher throughput & capacity
• Bulk of work – the IO path - is done by data servers
• Not much work added to metadata servers?
48/22/2013
Federated HDFS
• Namenodes(MDS) see their own namespace (NS)
• Each datanode can serve all namenodes
5
!
!
"""! """! """!
!!!!!!!!!!#$%!
!
#$!&!
"""! """!
!!!!!!!!!!#$!' !
( )*+' !, **)-!
. /0/&*12!%! . /0/&*12!3! . /0/&*12!4 !
" ##$!%!" ##$!!&!" ##$!!' !
#/4 2&*12!%! #/4 2&*12!' ! #/4 2&*12!&!
8/22/2013
Single Namenode
• Stores all metadata in memory
• Design is simple
• Provide low latency and high throughput metadata operations
• Support up to 3K data servers
• Hadoop clusters make it affordable to store old data
• Cold data is stored in the cluster for a long time
• Take up memory space but rarely used
• Growth of data size can exceed throughput
• Goal: remove space limits while maintain similar
performance
68/22/2013
Metadata in Namenode
• Namespace
• Stored as a linked tree structure by inodes
• Always visit from the top for any operation
• Blocks Map: block_id to location mapping
• Handle separately for huge number of blocks
• Datanode status
• IPaddress, capacity, load, heartbeat status, Block report status
• Leases
• Namespace and Block map uses the majority of memory
• This talk will focus on the Namespace
78/22/2013
Problem and Proposed Solution
• Problem:
• Remove namespace limit while maintain similar performance when
the working set can fit in memory
• Solution
• Retain the same namespace tree structure
• Store the namespace in persistent store using LSM (LevelDB)
• No separate edit logs nor checkpoints
• All Inode and their updates are persistent via LevelDB
• Fast startup, with the cost of slow initial operations
• Could prefetch inodes in
• Do not expect customers to drastically reduce the actual heap size
• Larger heap benefits transition between different working sets as
applications and workload changes
• A customer may occasionally run queries against cold data
88/22/2013
New Namenode Architecture
• Namespace
• Same as before, but only part of the tree is in memory
• On cache miss, read from levelDB
• Edit logs and checkpoints are replaced by LevelDB
• Update to LevelDB for every inode change
• Key: <parent_inode_number + name>
9
Namenode
Inode
edit
logs
Namenode
Inode
Inode
levelDB
buffer
WAL
LevelDB
Inode
levelDB
buffer
WAL
LevelDB
8/22/2013
Comparison w/Traditional FileSystem
• Traditional File Systems
• VFS layer keeps inode and directory entry cache
• Goal is to support the work load of single machine
• Relatively large number of files
• Support the applications from a single machine or in case of NFS from a
larger number of client machines
• Much much smaller workload and size compared to Hadoop use cases
• LevelDB based Namenode
• Support very large traffic of Hadoop cluster
• Keep a much larger number of INodes in memory
• Cache replacement policies to suite the Hadoop work load
• Data is in Datanodes
108/22/2013
LevelDB
• A fast key-value storage library written at Google
• Basic operations: get, put, delete
• Concurrency: single process w/multiple threads
• By default, writes are asynchronous
• As long as the machine doesn’t crash, it’s safe.
• Support synchronous writes
• No separate sync() operation
• Can be implemented by sync write/delete
• Support batch updates
• Data is automatically compressed using the Snappy
118/22/2013
Cache Replacement Policy
• Only whole directories are replaced in or out
• Hot dirs are all in cache, others will require levelDB scan
• Future – don’t cache very large dirs?
• No need to read from disks to check file existence
• LRU replacement policy
• Use CLOCK to approximate to reduce cost
• Separate thread for cache replacement
• Start replacement when threshold is exceeded
• Remove eviction out of sessions with lock
128/22/2013
Benchmark description
• NNThroughputBenchmark
• No RPC cost, call FileSystem method directly
• All operations are generated based on BFS order
• Each thread gets one portion of the work
• NN Load generator using YCSB++ framework (in progress)
• Normal HDFS client calls
• Thread either works in their own namespace, or choose randomly
• Load generator based on real cluster traces (in progress)
• Can you help me get traces from your cluster?
• Traditional Hadoop benchmark(in progress)
• E.g. Gridmix Expect little degradation when most work is for data
transfer
138/22/2013
Categories of tests
• Everything fits in memory
• Goal: should be almost the same as the current NN
• Working set does not fit in memory or changes over time
• Study various cache replacement policies
• Need to get good traces from real cluster to see patterns of hot,
warm and cold data
148/22/2013
Experiment Setup
• Hardware description (Susitna)
• CPU: AMD Opteron 6272, 64 bit, 16 MB L2, 16-core 2.1 GHz
• SSD: Crucial M4-CT064M4SSD2 SSD, 64 GB, SATA 6.0Gb/s
• (In progress) Use disks in future experiments
• Heap size is set to 1GB
• NNThroughputBenchmark
• No RPC cost, call FileSystem method directly
• All operations are generated based on BFS order
• Multiple threads, but each thread gets one portion of the work
• Each directory contains 100 subdirs and 100 files
• Named sequentially: ThroughputBenchDir1, ThroughputBench1
• LevelDB NN
• Cache monitor thread starts replacement when 90% full
158/22/2013
Create & close 2.4M files – all fit in cache
0
1000
2000
3000
4000
5000
6000
7000
8000
2 4 8 16
Throughputops/sec
Number of Threads
original
w/LevelDB
16
• Note files are not accessed, but clearly parent dirs are
• Note: Old NN and LevelDB NN peak at different # threads
• Degradation for peak throughput is 13.5%
8/22/2013
Create 9.6M files: 1% fits in cache
• Old NN with 8 threads and LevelDB NN with 16 threads.
• Performance remains about the same using LevelDB
• Namenode’s throughput drops to zero when memory exhausted
17
0
1000
2000
3000
4000
5000
6000
7000
20
120
220
320
420
520
620
720
820
920
1020
1120
1220
1320
1420
1520
1620
ThroughputOps/sec
Time in seconds
Original
LevelDB NN
8/22/2013
GetFileInfo
18
• ListStatus of first 600K of 2.4M files
• Each thread working on different part of tree
• Original NN: all fit in memory (of course)
• LevelDB NN: 2 cases: (1) all fit, (2) half fit
• Half fit: 10%-20% degradation - cache is constantly replaced
0
20000
40000
60000
80000
100000
120000
140000
2 4 8 16 32
ThroughputOps/sec
Number of Threads
Original
FitCache
HalfInCache
8/22/2013
Benchmarks that remain
• NNThroughputBenchmark
• No RPC cost, call FileSystem method directly
• All operations are generated based on BFS order
• Each thread gets one portion of the work
• NN Load generator using YCSB++ framework (in progress)
• Normal HDFS client calls
• Thread either works in their own namespace, or choose randomly
• Load generator based on real cluster traces (in progress)
• Can you help me get traces from your cluster?
• Traditional Hadoop benchmark(in progress)
• E.g. Gridmix Expect little degradation when most work is for data
transfer
198/22/2013
Summary
• Now that NN is HA, removing the namespace memory
limitation is one of most important problems to solve
• LSM (LevelDB) has worked out quite well
• Initial experiments have shown good results
• Need further benchmarks especially on how effective caching is for
different workloads and patterns
• Other LSM implementations? (e.g.HBase’s Java LSM)
• Work is done on branch 0.23
• Graduate student quality prototype (very good graduate student )
• But worked closed with the HDFS experts at Hortonworks
• Goal of internship was to see how well the idea worked
• Hortonworks plans to take this to the next stage once more experiments
are completed.
208/22/2013
Q&A
• Contact: lxiao+@cs.cmu.edu
• We’d love to get trace stats from your cluster 
• Simple java program to run against your audit logs
• Can also run as Mapreduce jobs
• Extract metadata operation stats without exposing sensitive info
• Please contact me if you could help!
218/22/2013

August 2013 HUG: Removing the NameNode's memory limitation

  • 1.
    REMOVING THE NAMENODE'S MEMORY LIMITATION LinXiao Intern@Hortonworks PhD student @ Carnegie Mellon University 18/22/2013
  • 2.
    About Me: LinXiao • Phd Student at CMU • Advisor: Garth Gibson • Thesis area – scalable distributed file systems • Intern at Hortonworks • Intern project: removing the Namenode memory limitation • Email: lxiao+@cs.cmu.edu 28/22/2013
  • 3.
    Big Data • Wecreate 2.5x1018 bytes of data per day [IBM] • Sloan Digital Sky Survey: 200GB/night • Facebook: 240 billions of photos till Jan,2013 • 250 million photos uploaded daily • Cloud storage • Amazon: 2 trillion objects, peak1.1 million op/sec • Need scalable storage systems • Scalable metadata <- focus of this presentation • Scalable storage • Scalable IO 38/22/2013
  • 4.
    Scalable Storage Systems •Separate data and metadata servers • More data nodes for higher throughput & capacity • Bulk of work – the IO path - is done by data servers • Not much work added to metadata servers? 48/22/2013
  • 5.
    Federated HDFS • Namenodes(MDS)see their own namespace (NS) • Each datanode can serve all namenodes 5 ! ! """! """! """! !!!!!!!!!!#$%! ! #$!&! """! """! !!!!!!!!!!#$!' ! ( )*+' !, **)-! . /0/&*12!%! . /0/&*12!3! . /0/&*12!4 ! " ##$!%!" ##$!!&!" ##$!!' ! #/4 2&*12!%! #/4 2&*12!' ! #/4 2&*12!&! 8/22/2013
  • 6.
    Single Namenode • Storesall metadata in memory • Design is simple • Provide low latency and high throughput metadata operations • Support up to 3K data servers • Hadoop clusters make it affordable to store old data • Cold data is stored in the cluster for a long time • Take up memory space but rarely used • Growth of data size can exceed throughput • Goal: remove space limits while maintain similar performance 68/22/2013
  • 7.
    Metadata in Namenode •Namespace • Stored as a linked tree structure by inodes • Always visit from the top for any operation • Blocks Map: block_id to location mapping • Handle separately for huge number of blocks • Datanode status • IPaddress, capacity, load, heartbeat status, Block report status • Leases • Namespace and Block map uses the majority of memory • This talk will focus on the Namespace 78/22/2013
  • 8.
    Problem and ProposedSolution • Problem: • Remove namespace limit while maintain similar performance when the working set can fit in memory • Solution • Retain the same namespace tree structure • Store the namespace in persistent store using LSM (LevelDB) • No separate edit logs nor checkpoints • All Inode and their updates are persistent via LevelDB • Fast startup, with the cost of slow initial operations • Could prefetch inodes in • Do not expect customers to drastically reduce the actual heap size • Larger heap benefits transition between different working sets as applications and workload changes • A customer may occasionally run queries against cold data 88/22/2013
  • 9.
    New Namenode Architecture •Namespace • Same as before, but only part of the tree is in memory • On cache miss, read from levelDB • Edit logs and checkpoints are replaced by LevelDB • Update to LevelDB for every inode change • Key: <parent_inode_number + name> 9 Namenode Inode edit logs Namenode Inode Inode levelDB buffer WAL LevelDB Inode levelDB buffer WAL LevelDB 8/22/2013
  • 10.
    Comparison w/Traditional FileSystem •Traditional File Systems • VFS layer keeps inode and directory entry cache • Goal is to support the work load of single machine • Relatively large number of files • Support the applications from a single machine or in case of NFS from a larger number of client machines • Much much smaller workload and size compared to Hadoop use cases • LevelDB based Namenode • Support very large traffic of Hadoop cluster • Keep a much larger number of INodes in memory • Cache replacement policies to suite the Hadoop work load • Data is in Datanodes 108/22/2013
  • 11.
    LevelDB • A fastkey-value storage library written at Google • Basic operations: get, put, delete • Concurrency: single process w/multiple threads • By default, writes are asynchronous • As long as the machine doesn’t crash, it’s safe. • Support synchronous writes • No separate sync() operation • Can be implemented by sync write/delete • Support batch updates • Data is automatically compressed using the Snappy 118/22/2013
  • 12.
    Cache Replacement Policy •Only whole directories are replaced in or out • Hot dirs are all in cache, others will require levelDB scan • Future – don’t cache very large dirs? • No need to read from disks to check file existence • LRU replacement policy • Use CLOCK to approximate to reduce cost • Separate thread for cache replacement • Start replacement when threshold is exceeded • Remove eviction out of sessions with lock 128/22/2013
  • 13.
    Benchmark description • NNThroughputBenchmark •No RPC cost, call FileSystem method directly • All operations are generated based on BFS order • Each thread gets one portion of the work • NN Load generator using YCSB++ framework (in progress) • Normal HDFS client calls • Thread either works in their own namespace, or choose randomly • Load generator based on real cluster traces (in progress) • Can you help me get traces from your cluster? • Traditional Hadoop benchmark(in progress) • E.g. Gridmix Expect little degradation when most work is for data transfer 138/22/2013
  • 14.
    Categories of tests •Everything fits in memory • Goal: should be almost the same as the current NN • Working set does not fit in memory or changes over time • Study various cache replacement policies • Need to get good traces from real cluster to see patterns of hot, warm and cold data 148/22/2013
  • 15.
    Experiment Setup • Hardwaredescription (Susitna) • CPU: AMD Opteron 6272, 64 bit, 16 MB L2, 16-core 2.1 GHz • SSD: Crucial M4-CT064M4SSD2 SSD, 64 GB, SATA 6.0Gb/s • (In progress) Use disks in future experiments • Heap size is set to 1GB • NNThroughputBenchmark • No RPC cost, call FileSystem method directly • All operations are generated based on BFS order • Multiple threads, but each thread gets one portion of the work • Each directory contains 100 subdirs and 100 files • Named sequentially: ThroughputBenchDir1, ThroughputBench1 • LevelDB NN • Cache monitor thread starts replacement when 90% full 158/22/2013
  • 16.
    Create & close2.4M files – all fit in cache 0 1000 2000 3000 4000 5000 6000 7000 8000 2 4 8 16 Throughputops/sec Number of Threads original w/LevelDB 16 • Note files are not accessed, but clearly parent dirs are • Note: Old NN and LevelDB NN peak at different # threads • Degradation for peak throughput is 13.5% 8/22/2013
  • 17.
    Create 9.6M files:1% fits in cache • Old NN with 8 threads and LevelDB NN with 16 threads. • Performance remains about the same using LevelDB • Namenode’s throughput drops to zero when memory exhausted 17 0 1000 2000 3000 4000 5000 6000 7000 20 120 220 320 420 520 620 720 820 920 1020 1120 1220 1320 1420 1520 1620 ThroughputOps/sec Time in seconds Original LevelDB NN 8/22/2013
  • 18.
    GetFileInfo 18 • ListStatus offirst 600K of 2.4M files • Each thread working on different part of tree • Original NN: all fit in memory (of course) • LevelDB NN: 2 cases: (1) all fit, (2) half fit • Half fit: 10%-20% degradation - cache is constantly replaced 0 20000 40000 60000 80000 100000 120000 140000 2 4 8 16 32 ThroughputOps/sec Number of Threads Original FitCache HalfInCache 8/22/2013
  • 19.
    Benchmarks that remain •NNThroughputBenchmark • No RPC cost, call FileSystem method directly • All operations are generated based on BFS order • Each thread gets one portion of the work • NN Load generator using YCSB++ framework (in progress) • Normal HDFS client calls • Thread either works in their own namespace, or choose randomly • Load generator based on real cluster traces (in progress) • Can you help me get traces from your cluster? • Traditional Hadoop benchmark(in progress) • E.g. Gridmix Expect little degradation when most work is for data transfer 198/22/2013
  • 20.
    Summary • Now thatNN is HA, removing the namespace memory limitation is one of most important problems to solve • LSM (LevelDB) has worked out quite well • Initial experiments have shown good results • Need further benchmarks especially on how effective caching is for different workloads and patterns • Other LSM implementations? (e.g.HBase’s Java LSM) • Work is done on branch 0.23 • Graduate student quality prototype (very good graduate student ) • But worked closed with the HDFS experts at Hortonworks • Goal of internship was to see how well the idea worked • Hortonworks plans to take this to the next stage once more experiments are completed. 208/22/2013
  • 21.
    Q&A • Contact: lxiao+@cs.cmu.edu •We’d love to get trace stats from your cluster  • Simple java program to run against your audit logs • Can also run as Mapreduce jobs • Extract metadata operation stats without exposing sensitive info • Please contact me if you could help! 218/22/2013