Your SlideShare is downloading. ×
Dynamic Namespace Partitioning with Giraffa File System
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Dynamic Namespace Partitioning with Giraffa File System


Published on

Published in: Technology

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Dynamic Namespace Partitioning withThe Giraffa File SystemKonstantin V. Shvachko Plamen JeliazkovFounder, Altoscale UC San Diego June 14, 2012 Hadoop Summit 2012 AltoScale
  • 2. AltoScale IntroductionPlamen Fresh grad from UCSD Internship with Hadoop Platform Team at eBay Wrote Giraffa prototypeKonstantin Founder of Altoscale. Primary focus 1. Altoscale Workbench Hadoop & HBase cluster on a public or a private cloud 2. Giraffa Apache Hadoop PMC HDFS scalabilty2
  • 3. AltoScale ContentsBackgroundMotivationArchitectureMain problems and solutions Bootstrapping Namespace Partitioning Rename3
  • 4. AltoScale GiraffaGiraffa is a distributed, highly available file systemUtilizes features of HDFS and HBaseNew open source project in experimental stage4
  • 5. AltoScale Origin:Giraffe. Latin: Giraffa camelopardalis Family Giraffidae Genus Giraffa Species Giraffa camelopardalisOther languages Arabic Zarafa Spanish Jirafa Bulgarian жирафа Italian GiraffaFavorites of my daughter o As the Hadoop traditions require5
  • 6. AltoScale Apache HadoopA reliable, scalable, high performance distributed computing systemThe Hadoop Distributed File System (HDFS) Reliable storage layerMapReduce – distributed computation framework Simple computational modelHadoop scales computation capacity, storage capacity, and I/O bandwidth by adding commodity servers.6
  • 7. AltoScale The Design PrinciplesLinear scalability More nodes can do more work within the same time On Data size and Compute resourcesReliability and Availability 1 drive fails in 3 years. Probability of failing today 1/1000. Several drives fail on a cluster with thousands of drivesMove computation to data Minimize expensive data transfersSequential data processing Avoid random reads7
  • 8. AltoScale Collocated Hadoop ClustersHDFS – a distributed file system NameNode – namespace and block management DataNodes – block replica containerMapReduce – a framework for distributed computations JobTracker – job scheduling, resource management, lifecycle coordination TaskTracker – task execution module NameNode JobTracker TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode8
  • 9. AltoScale Hadoop Distributed File SystemThe namespace is a hierarchy of files and directories Files are divided into large blocks 128 MBNamespace (metadata) is decoupled from data Fast namespace operations, not slowed down by Direct data streaming from the source storageSingle NameNode keeps the entire name space in RAMDataNodes store block replicas as files on local drives Blocks replicated on 3 DataNodes for redundancy & availabilityHDFS client – point of entry to HDFS Contacts NameNode for metadata Serves data to applications directly from DataNodes9
  • 10. AltoScale Scalability LimitsSingle-master architecture: a constraining resourceNameNode space limit 100 million files and 200 million blocks with 64GB RAM Restricts storage capacity to 20 PB Small file problem: block-to-file ratio is shrinkingSingle NameNode limits linear performance growth A handful of clients can saturate NameNodeMapReduce framework scalability limit: 40,000 clients Corresponds to a 4,000-node cluster with 10 MapReduce slots“HDFS Scalability: The limits to growth” USENIX ;login: 201010
  • 11. AltoScale Horizontal to Vertical ScalingHorizontal scaling is limited by single-master architectureNatural growth of compute power and storage density Clusters composed of more powerful serversVertical scaling leads to cluster size shrinkingStorage capacity, Compute power, and Cost remain constant11
  • 12. AltoScale Shrinking Clusters 2008 Yahoo!Resources per node: Cores, Disks, RAM 4000-node cluster 2010 Facebook 2000 nodes 2011 eBay 1000 nodes 2013 Cluster of 500 nodes Cluster Size: Number of Nodes12
  • 13. AltoScale Scalability for Hadoop 2.0HDFS Federation Independent NameNodes sharing a common pool of DataNodes Cluster is a family of volumes with shared block storage layer User sees volumes as isolated file systems ViewFS: the client-side mount tableYarn: New MapReduce framework Dynamic partitioning of cluster resources: no fixed slots Separation of JobTracker functions 1. Job scheduling and resource allocation: centralized 2. Job monitoring and job life-cycle coordination: decentralized o Delegate coordination of different jobs to other nodes13
  • 14. AltoScale Namespace PartitioningStatic: Federation Directory sub-trees are statically assigned to disjoint volumes Relocating sub-trees without copying is challenging Scale x10: billions of filesDynamic: Files, directory sub-trees can move automatically between nodes based on their utilization or load balancing requirements Files can be relocated without copying data blocks Scale x100: 100s of billion of filesOrthogonal independent approaches. Federation of distributed namespaces is possible14
  • 15. AltoScale Distributed Namespaces TodayCeph Metadata stored on OSD MDS cache metadata: Dynamic PartitioningLustre Plans to release (2.4) distributed namespace Code readyColossus: from Google S.Quinlan and J.Dean 100 million files per metadata server Hundreds of serversVoldFS, CassandraFS, KTHFS (MySQL) Prototypes15
  • 16. AltoScale HBase OverviewTable: big, sparse, loosely structured Collection of rows, sorted by row keys Rows can have arbitrary number of columnsTable is split Horizontally into Regions Dynamic Table partitioning! Region Servers serve regions to applicationsColumns grouped into Column families Vertical partition of tablesDistributed Cache: Regions are loaded in nodes’ RAM Real-time access to data16
  • 17. AltoScale HBase APIHBaseAdmin: administrative functions Create, delete, list tables Create, update, delete columns, column families Split, compact, flushHTable: access table data Result HTable.get(Get g) // get cells of a row void HTable.put(Put p) // update a row void HTable.put(Put[] p) // batch update of rows void HTable.delete(Delete d) // delete cells/row ResultScanner getScanner(family) // scan col familyCoprocessors: Custom actions triggered by update events Like database triggers and stored procedures17
  • 18. AltoScale HBase Architecture18
  • 19. AltoScale Giraffa File SystemHDFS + HBase = Giraffa Goal: build from existing building blocks Minimize changes to existing components1. Store file & directory metadata in HBase table Dynamic table partitioning into regions Cashed in RegionServer RAM for fast access2. Store file data in HDFS DataNodes: data streaming3. Block management Handle communication with DataNodes Perform block replication19
  • 20. AltoScale Giraffa RequirementsMore files & more dataAvailability Load balancing of metadata traffic Same data streaming speed to / from DataNodes No SPOFCluster operability, management Cost of running larger clusters same as smaller ones HDFS Federated HDFS GiraffaSpace 25 PB 120 PB 1 EB = 1000 PBFiles + blocks 200 million 1 billion 100 billionConcurrent Clients 40,000 100,000 1 million20
  • 21. AltoScale FAQ: Why HDFS and HBase?Building new FS from scratch – Really hard, Takes yearsHDFS a reliable, scalable block storage Efficient Data Streaming Automatic Data RecoveryHBase a natural metadata service Distributed Cache … Dynamic Partitioning Automatic Metadata RecoverySame breed, should be “compatible” HBase stores data in HDFS: same storage for data and metadata21
  • 22. AltoScale FAQ: Why not store whole files in HBase tables?Defeats the main concept of Distributed File Systems: Decoupling of data and metadataSmall files can be stored as rows Row size is limited by Region size Large files must be splitTechnically possible to split any information into rows o Log files: into events o Video files: into frames o Random bits: into 1K blobs with an offset as a row key Different level of abstraction Requires data conversion22
  • 23. AltoScale FAQ: My Dataset is Only 1 PB Do I Still Need Giraffa?Availability Distributed access to namespace for many concurrent clients Not bottlenecked by single NameNode performance“Small files” Block-to-file ration is decreasing: 2 –> 1.5 -> 1.2 No need to aggregate small files into large archives23
  • 24. AltoScale Building BlocksSingle Table called “Namespace” stores File ID (row key) and file attributes: o name, replication, block-size, permissions, times List of blocks Block locationsGiraffa client: FileSystem implementation Obtains metadata from HBase Data exchange with DataNodesBlock manager: maintain flat namespace of blocks Block allocation, replication, removal DataNode management Storage for the HBase table24
  • 25. AltoScale Giraffa Architecture HBase Namespace 1. Giraffa client path, attrs, block[], DN[][], BM-node gets files and blocks 1 Block Management Agent from HBase 2. May directly query Block NamespaceAgent App 2 Block Management Layer Manager 3. Stream data BM BM BM 3 to or from DataNodes DN DN DN DN DN DN DN DN DN25
  • 26. AltoScale Namespace TableRow keys Identify files and directories as rows in the table Different key definitions based on locality requirement Key definition is chosen during formatting of the file system Full-path-key is the defaultColumns File attributes: o Local name, owner, group, permissions, access-time, modification-time, block-size, replication, isDir, length List of blocks of a file o Persisted in the table List of block locations for each block o Not persisted, but discovered from block reports Directory table maps dir-entry name to corresponding row key26
  • 27. AltoScale Giraffa ClientGiraffaFileSystem implements FileSystem fs.defaultFS = grfa:/// fs.grfa.impl = o.a.giraffa.GiraffaFileSystemGiraffaClient extends DFSClient NamespaceAgent replaces NameNode RPC Namespace GiraffaFileSystem Agent GiraffaClient DFSClient to NameNode to DataNodes27
  • 28. AltoScale Block ManagementBlock Manager Block allocation, deletion, replicationDataNode Manager Process DataNode block reports, heartbeats. Identify lost nodesProvide storage for HBase table Small file system to store HFilesBMServer paired on the same node with RegionServer Distributed cluster of BMServes Mostly local communication between Region and BM serversNameNode is an initial implementation of BMServer Giraffa block is a single block file with the same name as block id28
  • 29. AltoScale Three ProblemsBootstrapping HBase stores tables as files in HDFSNamespace Partitioning Retain localityAtomic Renames29
  • 30. AltoScale Bootstrapping Block Manager Server .log HBase Volume hbase/ region1 Table layout / Rare updates giraffa/ region2 blk_123_001 dn-1 dn-2 dn-3 Block Volume blk_234_002 dn-11 dn-12 dn-13 Flat namespace of blocks blk_345_003 dn-101 dn-102 dn-10330
  • 31. AltoScale Locality of ReferenceRow keys Define sorting of files and directories in the table Tree structured namespace is flattened into linear arrayOrdered list of files is self-partitioned into regionsRetain locality in linearized structureFiles in the same directory - adjacent in the table Belong to the same region with some exclusionsFiles of the same directory should be on the same node Avoid jumping cross regions for simple “ls”31
  • 32. AltoScale Partitioning Example 1Straightforward partitioning based on random hashing 1 2 3 4 1 1 5 6 T1 T2 T3 T4 id1 id2 id332
  • 33. AltoScale Partitioning Example 2Partitioning based on lexicographic full-path ordering The default 1 2 3 4 15 16 T1 T2 T3 T4 1 1 1 1 1 2 2 T1 T2 3 T3 4 T4 1533
  • 34. AltoScale Partitioning Example 3Partitioning based on fixed depth neighborhoods 1 2 3 4 1 1 5 6 T1 T2 T3 T4 1 1 1 1 2 2 2 3 4 15 1634
  • 35. AltoScale Atomic RenameGiraffa will implement atomic in-place rename No support for atomic file move from one directory to anotherA move can then be implemented on application level Non-atomically move the target file from the source directory to a temporary file in the target directory Atomically rename the temporary file to its original name On failure use simple 3-step recovery procedureEventually implement atomic moves PAXOS Simplified synchronization algorithms35
  • 36. AltoScale History(2008) Idea. Study of distributed systems AFS, Lustre, Ceph, PVFS, GPFS, Farsite, … Partitioning of the namespace: 4 types of partitioning(2009) Study on scalability limits NameNode optimization(2010) Design with Michael Stack Presentation at HDFS contributors meeting(2011) Plamen implements POC(2012) Rewrite open sourced as Apache Extras project 
  • 37. AltoScale Status Design stage One node cluster running Live demo with Plamen37
  • 38. AltoScale Thank You!38