August 2013 HUG: Removing the NameNode's memory limitation
Upcoming SlideShare
Loading in...5
×
 

August 2013 HUG: Removing the NameNode's memory limitation

on

  • 1,910 views

Current HDFS Namenode stores all of its metadata in RAM. This has allowed Hadoop clusters to scale to 100K concurrent tasks. However, the memory limits the total number of files that a single NameNode ...

Current HDFS Namenode stores all of its metadata in RAM. This has allowed Hadoop clusters to scale to 100K concurrent tasks. However, the memory limits the total number of files that a single NameNode can store. While Federation allows one to create multiple volumes with additional Namenodes, there is a need to scale a single namespace and also to store multiple namespaces in a single Namenode.
This talk describes a project that removes the space limits while maintaining similar performance by caching only the working set or hot metadata in Namenode memory. We believe this approach will be very effective because the subset of files that is frequently accessed is much smaller than the full set of files stored in HDFS.
In this talk we will describe our overall approach and give details of our implementation along with some early performance numbers.

Speaker: Lin Xiao, PhD student at Carnegie Mellon University, intern at Hortonworks

Statistics

Views

Total Views
1,910
Views on SlideShare
1,910
Embed Views
0

Actions

Likes
1
Downloads
20
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

August 2013 HUG: Removing the NameNode's memory limitation August 2013 HUG: Removing the NameNode's memory limitation Presentation Transcript

  • REMOVING THE NAMENODE'S MEMORY LIMITATION Lin Xiao Intern@Hortonworks PhD student @ Carnegie Mellon University 18/22/2013
  • About Me: Lin Xiao • Phd Student at CMU • Advisor: Garth Gibson • Thesis area – scalable distributed file systems • Intern at Hortonworks • Intern project: removing the Namenode memory limitation • Email: lxiao+@cs.cmu.edu 28/22/2013
  • Big Data • We create 2.5x1018 bytes of data per day [IBM] • Sloan Digital Sky Survey: 200GB/night • Facebook: 240 billions of photos till Jan,2013 • 250 million photos uploaded daily • Cloud storage • Amazon: 2 trillion objects, peak1.1 million op/sec • Need scalable storage systems • Scalable metadata <- focus of this presentation • Scalable storage • Scalable IO 38/22/2013 View slide
  • Scalable Storage Systems • Separate data and metadata servers • More data nodes for higher throughput & capacity • Bulk of work – the IO path - is done by data servers • Not much work added to metadata servers? 48/22/2013 View slide
  • Federated HDFS • Namenodes(MDS) see their own namespace (NS) • Each datanode can serve all namenodes 5 ! ! """! """! """! !!!!!!!!!!#$%! ! #$!&! """! """! !!!!!!!!!!#$!' ! ( )*+' !, **)-! . /0/&*12!%! . /0/&*12!3! . /0/&*12!4 ! " ##$!%!" ##$!!&!" ##$!!' ! #/4 2&*12!%! #/4 2&*12!' ! #/4 2&*12!&! 8/22/2013
  • Single Namenode • Stores all metadata in memory • Design is simple • Provide low latency and high throughput metadata operations • Support up to 3K data servers • Hadoop clusters make it affordable to store old data • Cold data is stored in the cluster for a long time • Take up memory space but rarely used • Growth of data size can exceed throughput • Goal: remove space limits while maintain similar performance 68/22/2013
  • Metadata in Namenode • Namespace • Stored as a linked tree structure by inodes • Always visit from the top for any operation • Blocks Map: block_id to location mapping • Handle separately for huge number of blocks • Datanode status • IPaddress, capacity, load, heartbeat status, Block report status • Leases • Namespace and Block map uses the majority of memory • This talk will focus on the Namespace 78/22/2013
  • Problem and Proposed Solution • Problem: • Remove namespace limit while maintain similar performance when the working set can fit in memory • Solution • Retain the same namespace tree structure • Store the namespace in persistent store using LSM (LevelDB) • No separate edit logs nor checkpoints • All Inode and their updates are persistent via LevelDB • Fast startup, with the cost of slow initial operations • Could prefetch inodes in • Do not expect customers to drastically reduce the actual heap size • Larger heap benefits transition between different working sets as applications and workload changes • A customer may occasionally run queries against cold data 88/22/2013
  • New Namenode Architecture • Namespace • Same as before, but only part of the tree is in memory • On cache miss, read from levelDB • Edit logs and checkpoints are replaced by LevelDB • Update to LevelDB for every inode change • Key: <parent_inode_number + name> 9 Namenode Inode edit logs Namenode Inode Inode levelDB buffer WAL LevelDB Inode levelDB buffer WAL LevelDB 8/22/2013
  • Comparison w/Traditional FileSystem • Traditional File Systems • VFS layer keeps inode and directory entry cache • Goal is to support the work load of single machine • Relatively large number of files • Support the applications from a single machine or in case of NFS from a larger number of client machines • Much much smaller workload and size compared to Hadoop use cases • LevelDB based Namenode • Support very large traffic of Hadoop cluster • Keep a much larger number of INodes in memory • Cache replacement policies to suite the Hadoop work load • Data is in Datanodes 108/22/2013
  • LevelDB • A fast key-value storage library written at Google • Basic operations: get, put, delete • Concurrency: single process w/multiple threads • By default, writes are asynchronous • As long as the machine doesn’t crash, it’s safe. • Support synchronous writes • No separate sync() operation • Can be implemented by sync write/delete • Support batch updates • Data is automatically compressed using the Snappy 118/22/2013
  • Cache Replacement Policy • Only whole directories are replaced in or out • Hot dirs are all in cache, others will require levelDB scan • Future – don’t cache very large dirs? • No need to read from disks to check file existence • LRU replacement policy • Use CLOCK to approximate to reduce cost • Separate thread for cache replacement • Start replacement when threshold is exceeded • Remove eviction out of sessions with lock 128/22/2013
  • Benchmark description • NNThroughputBenchmark • No RPC cost, call FileSystem method directly • All operations are generated based on BFS order • Each thread gets one portion of the work • NN Load generator using YCSB++ framework (in progress) • Normal HDFS client calls • Thread either works in their own namespace, or choose randomly • Load generator based on real cluster traces (in progress) • Can you help me get traces from your cluster? • Traditional Hadoop benchmark(in progress) • E.g. Gridmix Expect little degradation when most work is for data transfer 138/22/2013
  • Categories of tests • Everything fits in memory • Goal: should be almost the same as the current NN • Working set does not fit in memory or changes over time • Study various cache replacement policies • Need to get good traces from real cluster to see patterns of hot, warm and cold data 148/22/2013
  • Experiment Setup • Hardware description (Susitna) • CPU: AMD Opteron 6272, 64 bit, 16 MB L2, 16-core 2.1 GHz • SSD: Crucial M4-CT064M4SSD2 SSD, 64 GB, SATA 6.0Gb/s • (In progress) Use disks in future experiments • Heap size is set to 1GB • NNThroughputBenchmark • No RPC cost, call FileSystem method directly • All operations are generated based on BFS order • Multiple threads, but each thread gets one portion of the work • Each directory contains 100 subdirs and 100 files • Named sequentially: ThroughputBenchDir1, ThroughputBench1 • LevelDB NN • Cache monitor thread starts replacement when 90% full 158/22/2013
  • Create & close 2.4M files – all fit in cache 0 1000 2000 3000 4000 5000 6000 7000 8000 2 4 8 16 Throughputops/sec Number of Threads original w/LevelDB 16 • Note files are not accessed, but clearly parent dirs are • Note: Old NN and LevelDB NN peak at different # threads • Degradation for peak throughput is 13.5% 8/22/2013
  • Create 9.6M files: 1% fits in cache • Old NN with 8 threads and LevelDB NN with 16 threads. • Performance remains about the same using LevelDB • Namenode’s throughput drops to zero when memory exhausted 17 0 1000 2000 3000 4000 5000 6000 7000 20 120 220 320 420 520 620 720 820 920 1020 1120 1220 1320 1420 1520 1620 ThroughputOps/sec Time in seconds Original LevelDB NN 8/22/2013
  • GetFileInfo 18 • ListStatus of first 600K of 2.4M files • Each thread working on different part of tree • Original NN: all fit in memory (of course) • LevelDB NN: 2 cases: (1) all fit, (2) half fit • Half fit: 10%-20% degradation - cache is constantly replaced 0 20000 40000 60000 80000 100000 120000 140000 2 4 8 16 32 ThroughputOps/sec Number of Threads Original FitCache HalfInCache 8/22/2013
  • Benchmarks that remain • NNThroughputBenchmark • No RPC cost, call FileSystem method directly • All operations are generated based on BFS order • Each thread gets one portion of the work • NN Load generator using YCSB++ framework (in progress) • Normal HDFS client calls • Thread either works in their own namespace, or choose randomly • Load generator based on real cluster traces (in progress) • Can you help me get traces from your cluster? • Traditional Hadoop benchmark(in progress) • E.g. Gridmix Expect little degradation when most work is for data transfer 198/22/2013
  • Summary • Now that NN is HA, removing the namespace memory limitation is one of most important problems to solve • LSM (LevelDB) has worked out quite well • Initial experiments have shown good results • Need further benchmarks especially on how effective caching is for different workloads and patterns • Other LSM implementations? (e.g.HBase’s Java LSM) • Work is done on branch 0.23 • Graduate student quality prototype (very good graduate student ) • But worked closed with the HDFS experts at Hortonworks • Goal of internship was to see how well the idea worked • Hortonworks plans to take this to the next stage once more experiments are completed. 208/22/2013
  • Q&A • Contact: lxiao+@cs.cmu.edu • We’d love to get trace stats from your cluster  • Simple java program to run against your audit logs • Can also run as Mapreduce jobs • Extract metadata operation stats without exposing sensitive info • Please contact me if you could help! 218/22/2013