Federated HDFS
Upcoming SlideShare
Loading in...5

Federated HDFS






Total Views
Views on SlideShare
Embed Views



10 Embeds 3,558

http://dongxicheng.org 3301
http://lanyrd.com 234
http://www.zhuaxia.com 9
http://translate.googleusercontent.com 5
https://webcache.googleusercontent.com 4
http://webcache.googleusercontent.com 1
http://cache.baidu.com 1
https://note.sdo.com 1
https://lanyrd.com 1
http://cache.baiducontent.com 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Federated HDFS Federated HDFS Presentation Transcript

    • HDFS FederationSanjay Radia, Hadoop ArchitectYahoo! IncApache HadoopIndia Summit 2011 1
    • Outline Hadoop Components• HDFS - Quick overview HDFS Distributed file• Scaling HDFS - Federation system MapReduce Distributed computation HBase Column store Pig Dataflow language Hive Data warehouse Zookeeper Distributed coordination Avro Data Serialization Oozie Workflow
    • 3
    • HDFS Namespace Metadata & Journal Backup Namespace Block Namenode State MapHierarchal Namespace Namenode Block ID  Block LocationsFile Name  BlockIDs Heartbeats & Block Reports Datanodesb1 b3 b2 b4 b1 b3 b3 b2 b6 Block ID  Datab2 b3 b5 b5 b5 b4 Horizontally Scale IO and Storage 4
    • HDFS Client reads and writes Namespace Block State Map 1 open 1 create NamenodeClient Client 2 read End-to-end checksum 2 writeb1 b3 b2 b4 b1 b3 b3 b2 b6b2 b3 b5 b5 b5 b4 write write Datanodes 5
    • HDFS Architecture : Computation close to the data Hadoop ClusterDataData data data data dataData data data data data Block 1 Block 1Data data data data data Block 1Data data data data data ResultsData data data data data MAP Data data data dataData data data data data Block 2 Data data data data Data data data dataData data data data data Block 2 MAP Data data data dataData data data data data Reduce Data data data dataData data data data data Block 2 Data data data data Data data data dataData data data data data Data data data dataData data data data data Data data data dataData data data data data MAP Block 3 Block 3 Block 3 6
    • Quiz: What Is the Common Attribute? 7
    • HDFS Actively maintain data reliability Namespace Block State Map NamenodeBad/lost 1. 3. Periodicallyblock replica replicate blockReceived check block checksums b1 b3 b2 b4 b1 b3 b3 b2 b6 2. copy b2 b3 b5 b5 b5 b4 Datanodes
    • Hadoop at Yahoo! Availability SLA250,000 Sandbox 99.69 Total Nodes = 43,936 Research 99.47 Total Storage = 206 PB200,000 1M+ Monthly Hadoop Jobs Production 99.85 99.2 99.3 99.4 99.5 99.6 99.7 99.8 99.9150,000 Nodes running Hadoop at Yahoo!100,000 Sandbox 7,803 Over 43,000 nodes running Hadoop 50,000 Research 22,334 0 2006 - 2006 - 2006 - 2006 - 2007 - 2007 - 2007 - 2007 - 2008 - 2008 - 2008 - 2008 - 2009 - 2009 - 2009 - 2009 - 2010 - 2010 - 2010 - Qtr1 Qtr2 Qtr3 Qtr4 Qtr1 Qtr2 Qtr3 Qtr4 Qtr1 Qtr2 Qtr3 Qtr4 Qtr1 Qtr2 Qtr3 Qtr4 Qtr1 Qtr2 Qtr3 Production 13,687 0 5000 10000 15000 20000 25000 9
    • Scaling Hadoop Early Gains • Simple design allowed rapid improvements • Namespace is all in RAM, simpler locking • Improved memory usage in 0.16, JVM Heap configuration (Suresh Srinivas) Growth of number of files and storage is limited by adding RAM to namenode • 50G heap = 200M “fs objects” = 100M names + 100MBlocks • 14PB of storage (50MB blocksize) • 4K nodes - Job Tracker carries out both job lifecycle management and scheduling Yahoo’s Response: • HDFS Federation: horizontal scaling of namespace (0.22) • Next Generation of Map-Reduce - Complete overhaul of job tracker/task tracker Goal: • Clusters of 6000 nodes, 100,000 cores & 10k concurrent jobs, 100 PB raw storage per cluster 10 6 May 2010
    • Not to scale Scaling the Name Service: Options Block-reports for Billions of blocks requires rethinking# clients block layer100x Good isolation properties50x Distributed NNs20x Partial Multiple NS in memory Namespace With Namespace volumes volumes4x Separate Bmaps from NN Partial All NS1x in memory Archives NS (Cache) in memory # names 100M 200M 1B 2B 10B 20B 11
    • Opportunity: Vertical & Horizontal scaling Vertical scaling More RAM, Efficiency in memory usage First class archives (tar/zip like) Partial namespace in main memoryNamenode Horizontal: Federation Horizontal scaling/federation benefits: – Scale – Isolation, Stability, Availability – Flexibility – Other Namenode implementations or non-HDFS namespaces 12
    • Block (Object) Storage SubsystemBlock (Object) Storage Subsystem• Shared storage provided as pools of blocks• Namespaces (HDFS, others) use one or more block-pools• Note: HDFS has 2 layers today – we are generalizing/extending it. Namespace NS1 ... NS k ... Foreign NS n Pools 1 Pools k Pools n Block storage B Block Pools a l a nDatanode 1 Datanode 2 Datanode m c ... ... ... e r 13
    • 1st Phase: B-Pool management inside NamenodeNN-1 NN-k NN-n NS1 ... ... Foreign NS k NS n Future: Move Block Pools 1 Pools k Pools n mgt into separate Block Pools nodes B a l a n cDatanode 1 Datanode 2 Datanode m e ... ... ... r 14
    • Future: Move block management out ... ... Foreign NS1 NS k NS n Easier to scale horizontally 1. Open than the name server Pools 1 Pools k Pools nclient 2. getBlockLocations Block Manager Block Pools B a l a 3. ReadBlock n c e r Datanode 1 Datanode 2 Datanode m ... ... ... 15
    • What is a HDFS Cluster Current New• HDFS Cluster • HDFS Cluster – 1 Namespace – N Namespaces – A set of blocks – Set of block-pools • Each block-pool is set of blocks • Phase 1: 1 BP per NS – Implies N block-pools• Implemented as • Implemented as – 1 Namenode – N Namenode – Set of DNs – Set of DNs • Each DN stores the blocks for each block-pool 16
    • Managing Namespaces / Client-side• Federation has multiple namespaces mount-table – don’t you need a single global namespace? – Key is to share the data and the names used to access the shared data project hom tmp data. e• A global namespace is one way to do that – but even there we talk of several large “global” namespaces• Client-side mount table is another way to share – Shared mount-table => “global” shared view – Personalized mount-table => per- application view • Share the data that matter by mounting it
    • HDFS Federation Across Clusters / Application / Application mount- mount- table in table in Cluster 2 Cluster 1 home tmp hometmp data project data project Cluster 2 Cluster 1 18
    • Nameserver as container for namespaces• Nameserver as a container for namespaces • Each namespace with its own separate state • Persistent state in shared storage (e.g. Book Keeper)• Each nameserver serves a set of namespaces • Selected based on isolation and capacity • A namespace can be moved between nameserver … Nameserver Nameserver … Shared persistent storage for namespace metadata (e.g. Book keeper) 19
    • Summary Federated HDFS (Jira HDFS-1052) • Scale by adding independent Namenodes • Preserves the robustness of the Namenodes • Not much code change to the Namenode • Generalizes the Block storage layer • Analogous to Sans & Luns • Can add other implementations of the Namenodes • Even other name services (HBase?) • Could move the Block management out of the Namenode in the future • But to truly scale to 10s or 100s Bilions of blocks we need to rethink the block map and block reports • Benefits • Scale number of file names and blocks • Improved isolation and hence availability 20 6 May 2010
    • Q&A 21