HDFS FederationSanjay Radia, Hadoop Architect Yahoo! Inc<br />1<br />
Outline<br />HDFS - Quick overview<br />Scaling HDFS - Federation<br />Hadoop Components<br />
3<br />
4<br />HDFS<br />b1<br />b3<br />b1<br />b3<br />b3<br />b2<br />b2<br />b4<br />b2<br />b5<br />b5<br />b3<br />b6<br />b...
5<br />HDFSClient reads and writes<br />b1<br />b3<br />b1<br />b3<br />b3<br />b2<br />b2<br />b4<br />b2<br />b5<br />b5...
HDFS Architecture :	 Computation close to the data<br />Hadoop Cluster<br />Data<br />Data data data data data<br />Data d...
Quiz: What Is the Common Attribute?<br />7<br />
HDFS Actively maintain data reliability<br />b1<br />b3<br />b1<br />b3<br />b3<br />b2<br />b2<br />b4<br />b2<br />b5<br...
Hadoop at Yahoo!<br />1M+ Monthly Hadoop Jobs<br />9<br />
Scaling Hadoop<br />Early Gains<br /><ul><li>Simple design allowed rapid improvements</li></ul>Namespace is all in RAM, si...
4K nodes</li></ul>-  Job Tracker carries out both job lifecycle management and scheduling<br />Yahoo’s Response:<br /><ul>...
Next Generation of Map-Reduce - Complete overhaul of job tracker/task tracker</li></ul>Goal: <br /><ul><li>Clusters of 600...
Scaling the Name Service: Options<br />Separate Bmaps from NN<br />Not to scale<br />Block-reports for Billions of blocks ...
Opportunity:Vertical & Horizontal scaling<br />12<br />Vertical scaling<br />More RAM, Efficiency in memory usage<br />Fir...
Datanode 1<br />Datanode 2<br />Datanode m<br />Pools  n<br />Pools  1<br />Pools  k<br />...<br />...<br />...<br />     ...
Namespaces (HDFS, others) use one or more block-pools
Note: HDFS has 2 layers today – we are generalizing/extending it.</li></ul>Namespace<br />Foreign NS n<br />          NS1<...
1st Phase: B-Pool management inside Namenode<br />Datanode 2<br />Datanode m<br />Datanode 1<br />...<br />...<br />...<br...
Future: 	Move block management out<br />15<br />Datanode 1<br />Datanode 2<br />Datanode m<br />Pools  n<br />Pools  k<br ...
What is a HDFS Cluster<br />Current<br />HDFS Cluster<br />1 Namespace<br />A set of blocks<br />Implemented as<br />1 Nam...
Managing Namespaces<br />HDFS Namespaces as a first class entity<br />Many many namespaces: one per-user or per-project<br...
Upcoming SlideShare
Loading in …5
×

Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia

5,030 views

Published on

1 Comment
7 Likes
Statistics
Notes
  • Hello
    My name is mercy,i saw your profile today and became intrested in you,i will also like to know you more,and if you can send an email to my email address,i will give you my pictures here is my email address (jonesmercy23@yahoo.i n) I believe we can move from here! Awaiting for your mail to my email address above.Thanks
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
5,030
On SlideShare
0
From Embeds
0
Number of Embeds
248
Actions
Shares
0
Downloads
130
Comments
1
Likes
7
Embeds 0
No embeds

No notes for slide
  • The data nodes not have RAID, just JBOD
  • Replication is rack aware
  • 50K blocks (50MB block size)48GB heap -= 180M object = 90M files, 90M blocks = 14PB (includes overhead of 3 replicas)
  • Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia

    1. HDFS FederationSanjay Radia, Hadoop Architect Yahoo! Inc<br />1<br />
    2. Outline<br />HDFS - Quick overview<br />Scaling HDFS - Federation<br />Hadoop Components<br />
    3. 3<br />
    4. 4<br />HDFS<br />b1<br />b3<br />b1<br />b3<br />b3<br />b2<br />b2<br />b4<br />b2<br />b5<br />b5<br />b3<br />b6<br />b4<br />b5<br />Namespace Metadata & Journal<br />Backup Namenode<br />Namenode<br />Namespace<br />State<br />Block Map<br />Block ID  Block Locations<br />Hierarchal Namespace<br />File Name  BlockIDs<br />Heartbeats & Block Reports<br />Datanodes<br />Block ID  Data<br />Horizontally Scale IO and Storage<br />
    5. 5<br />HDFSClient reads and writes<br />b1<br />b3<br />b1<br />b3<br />b3<br />b2<br />b2<br />b4<br />b2<br />b5<br />b5<br />b3<br />b6<br />b4<br />b5<br />Namenode<br />Namespace<br />State<br />Block Map<br />1 create<br />1 open<br />Client<br />Client<br />End-to-end checksum<br />2 read<br />2 write<br />write<br />write<br />Datanodes<br />
    6. HDFS Architecture : Computation close to the data<br />Hadoop Cluster<br />Data<br />Data data data data data<br />Data data data data data<br />Data data data data data<br />Data data data data data<br />Data data data data data<br />Data data data data data<br />Data data data data data<br />Data data data data data<br />Data data data data data<br />Data data data data data<br />Data data data data data<br />Data data data data data<br />Block 1<br />Block 1<br />Results<br />Data data data data<br />Data data data data<br />Data data data data<br />Data data data data<br />Data data data data<br />Data data data data<br />Data data data data<br />Data data data data<br />Data data data data<br />Block 1<br />MAP<br />Block 2<br />Block 2<br />MAP<br />Reduce<br />Block 2<br />MAP<br />Block 3<br />Block 3<br />Block 3<br />6<br />
    7. Quiz: What Is the Common Attribute?<br />7<br />
    8. HDFS Actively maintain data reliability<br />b1<br />b3<br />b1<br />b3<br />b3<br />b2<br />b2<br />b4<br />b2<br />b5<br />b5<br />b3<br />b6<br />b4<br />b5<br />Namenode<br />Namespace<br />State<br />Block Map<br />Bad/lost block replica<br />Periodically check block checksums<br />1. replicate<br />3. blockReceived<br />2. copy<br />Datanodes<br />
    9. Hadoop at Yahoo!<br />1M+ Monthly Hadoop Jobs<br />9<br />
    10. Scaling Hadoop<br />Early Gains<br /><ul><li>Simple design allowed rapid improvements</li></ul>Namespace is all in RAM, simpler locking<br />Improved memory usage in 0.16, JVM Heap configuration (Suresh Srinivas)<br />Growth of number of files and storage is limited by adding RAM to namenode<br />50G heap = 200M “fs objects” = 100M names + 100MBlocks<br /><ul><li>14PB of storage (50MB blocksize)
    11. 4K nodes</li></ul>- Job Tracker carries out both job lifecycle management and scheduling<br />Yahoo’s Response:<br /><ul><li>HDFS Federation: horizontal scaling of namespace (0.22)
    12. Next Generation of Map-Reduce - Complete overhaul of job tracker/task tracker</li></ul>Goal: <br /><ul><li>Clusters of 6000 nodes, 100,000 cores & 10k concurrent jobs, 100 PB raw storage per cluster</li></ul>6 May 2010<br />10<br />
    13. Scaling the Name Service: Options<br />Separate Bmaps from NN<br />Not to scale<br />Block-reports for Billions of blocks requires rethinking<br /> block layer<br /># clients<br />Good isolation <br />properties<br />100x<br />50x<br />Distributed NNs<br />20x<br />Multiple<br /> Namespace <br />volumes<br />Partial<br />NS in memory<br />With Namespace<br /> volumes <br />4x<br />All NS<br /> in memory<br />Partial<br /> NS (Cache)<br /> in memory<br />1x<br />Archives<br /># names<br />100M<br />10B<br />200M<br />1B<br />2B<br />20B<br />11<br />
    14. Opportunity:Vertical & Horizontal scaling<br />12<br />Vertical scaling<br />More RAM, Efficiency in memory usage<br />First class archives (tar/zip like)<br />Partial namespace in main memory<br />Horizontal: Federation<br />Namenode<br />Horizontal scaling/federation benefits:<br />Scale<br />Isolation, Stability, Availability<br />Flexibility<br />Other Namenode implementations or non-HDFS namespaces<br />
    15. Datanode 1<br />Datanode 2<br />Datanode m<br />Pools n<br />Pools 1<br />Pools k<br />...<br />...<br />...<br /> Block Pools<br />Balancer<br />Block (Object) Storage Subsystem<br />Block (Object) Storage Subsystem<br /><ul><li>Shared storage provided as pools of blocks
    16. Namespaces (HDFS, others) use one or more block-pools
    17. Note: HDFS has 2 layers today – we are generalizing/extending it.</li></ul>Namespace<br />Foreign NS n<br /> NS1<br />...<br />...<br /> NS k<br />Block storage<br />13<br />
    18. 1st Phase: B-Pool management inside Namenode<br />Datanode 2<br />Datanode m<br />Datanode 1<br />...<br />...<br />...<br />Pools k<br />Pools n<br />Pools 1<br /> Block Pools<br />Balancer<br />NN-n<br />NN-k<br />NN-1<br />Foreign NS n<br /> NS1<br />...<br />...<br /> NS k<br />Future: Move Block mgt into separate nodes<br />14<br />
    19. Future: Move block management out<br />15<br />Datanode 1<br />Datanode 2<br />Datanode m<br />Pools n<br />Pools k<br />Pools 1<br />...<br />...<br />...<br /> Block Pools<br />Balancer<br />Foreign NS n<br /> NS1<br />...<br />...<br /> NS k<br />Easier to scale horizontally than the name server<br />1. Open<br />client<br />Block Manager<br />2. getBlockLocations<br />3. ReadBlock<br />
    20. What is a HDFS Cluster<br />Current<br />HDFS Cluster<br />1 Namespace<br />A set of blocks<br />Implemented as<br />1 Namenode<br />Set of DNs<br />New<br />HDFS Cluster<br />N Namespaces<br /> Set of block-pools<br />Each block-pool is set of blocks<br />Phase 1: 1 BP per NS<br />Implies N block-pools<br />Implemented as<br />N Namenode<br />Set of DNs<br />Each DN stores the blocks for each block-pool<br />16<br />
    21. Managing Namespaces<br />HDFS Namespaces as a first class entity<br />Many many namespaces: one per-user or per-project<br />Why? Because it can’t fit in a server? No<br />Pieces of data are often autonomous<br />Log data from different dates<br />Photos/videos loaded by a user<br />A user’s mail, or his home directory<br />The key is sharing the data<br />A global namespace is one way to do that – but even there we talk of several large “global” namespaces<br />Client-side mount table is another way to share<br />Shared mount-table => “global” shared view<br />Personalized mount-table => per-application view<br />Share the data that matter by mounting it<br />17<br />Plan 9, Spring OS:<br />dad personalized namespaces<br />
    22. 18<br />HDFS Federation Across Clusters <br />/<br />Application mount-table in Cluster 1<br />/<br />Application mount-table in Cluster 2<br />home<br />tmp<br />home<br />tmp<br />data<br />project<br />project<br />data<br />Cluster 2<br />Cluster 1<br />
    23. Nameserver as container for namespaces<br /><ul><li>Nameserver as a container for namespaces
    24. Each namespace with its own separate state</li></ul>Persistent state in shared storage (e.g. Book Keeper)<br /><ul><li>Each nameserver serves a set of namespaces
    25. Selected based on isolation and capacity
    26. A namespace can be moved between nameserver</li></ul>19<br />…<br />Nameserver<br />Nameserver<br />…<br />Shared persistent storage for namespace metadata<br /> (e.g. Book keeper)<br />
    27. Summary<br />Federated HDFS (Jira HDFS-1052)<br /><ul><li>Scale by adding independent Namenodes</li></ul>Preserves the robustness of the Namenodes<br />Not much code change to the Namenode<br /><ul><li>Generalizes the Block storage layer</li></ul>Analogous to Sans & Luns<br />Can add other implementations of the Namenodes<br />Even other name services (HBase?)<br />Could move the Block management out of the Namenode in the future<br />But to truly scale to 10s or 100s Bilions of blocks we need to rethink the block map and block reports<br /><ul><li>Benefits</li></ul>Scale number of file names and blocks<br />Improved isolation and hence availability<br />6 May 2010<br />20<br />
    28. Q & A<br />21<br />

    ×