Dynamic Namespace Partitioning withThe Giraffa File SystemKonstantin V. Shvachko            Plamen JeliazkovFounder, Altos...
AltoScale                                    IntroductionPlamen    Fresh grad from UCSD    Internship with Hadoop Platf...
AltoScale               ContentsBackgroundMotivationArchitectureMain problems and solutions    Bootstrapping    Name...
AltoScale                 GiraffaGiraffa is a distributed, highly available file systemUtilizes features of HDFS and HBa...
AltoScale                          Origin:Giraffe. Latin: Giraffa camelopardalis    Family      Giraffidae    Genus      ...
AltoScale                      Apache HadoopA reliable, scalable, high performance distributed computing systemThe Hadoo...
AltoScale                       The Design PrinciplesLinear scalability    More nodes can do more work within the same t...
AltoScale                 Collocated Hadoop ClustersHDFS – a distributed file system    NameNode – namespace and block m...
AltoScale      Hadoop Distributed File SystemThe namespace is a hierarchy of files and directories    Files are divided ...
AltoScale                            Scalability LimitsSingle-master architecture: a constraining resourceNameNode space...
AltoScale         Horizontal to Vertical ScalingHorizontal scaling is limited by single-master architectureNatural growt...
AltoScale                            Shrinking Clusters                                                                   ...
AltoScale                  Scalability for Hadoop 2.0HDFS Federation     Independent NameNodes sharing a common pool of ...
AltoScale                  Namespace PartitioningStatic: Federation     Directory sub-trees are statically assigned to  ...
AltoScale       Distributed Namespaces TodayCeph     Metadata stored on OSD     MDS cache metadata: Dynamic Partitionin...
AltoScale                           HBase OverviewTable: big, sparse, loosely structured     Collection of rows, sorted ...
AltoScale                                      HBase APIHBaseAdmin: administrative functions     Create, delete, list ta...
AltoScale   HBase Architecture18
AltoScale                         Giraffa File SystemHDFS + HBase = Giraffa     Goal: build from existing building block...
AltoScale                      Giraffa RequirementsMore files & more dataAvailability     Load balancing of metadata tr...
AltoScale         FAQ: Why HDFS and HBase?Building new FS from scratch – Really hard, Takes yearsHDFS a reliable, scalab...
AltoScale                                            FAQ: Why not store                                   whole files in H...
AltoScale                           FAQ: My Dataset is Only 1 PB                                 Do I Still Need Giraffa?...
AltoScale                           Building BlocksSingle Table called “Namespace” stores     File ID (row key) and file...
AltoScale                                            Giraffa Architecture                              HBase              ...
AltoScale                              Namespace TableRow keys     Identify files and directories as rows in the table  ...
AltoScale                                                              Giraffa ClientGiraffaFileSystem implements FileSys...
AltoScale                              Block ManagementBlock Manager     Block allocation, deletion, replicationDataNod...
AltoScale                           Three ProblemsBootstrapping     HBase stores tables as files in HDFSNamespace Parti...
AltoScale                                                  Bootstrapping          Block Manager Server                    ...
AltoScale                      Locality of ReferenceRow keys     Define sorting of files and directories in the table   ...
AltoScale                           Partitioning Example 1Straightforward partitioning based on random hashing           ...
AltoScale                                Partitioning Example 2Partitioning based on lexicographic full-path ordering    ...
AltoScale                              Partitioning Example 3Partitioning based on fixed depth neighborhoods             ...
AltoScale                              Atomic RenameGiraffa will implement atomic in-place rename     No support for ato...
AltoScale                                             History(2008) Idea. Study of distributed systems     AFS, Lustre, ...
AltoScale             Status Design stage One node cluster running     Live demo with Plamen37
AltoScale   Thank You!38
Upcoming SlideShare
Loading in …5
×

Dynamic Namespace Partitioning with Giraffa File System

6,857 views

Published on

Published in: Technology
0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
6,857
On SlideShare
0
From Embeds
0
Number of Embeds
158
Actions
Shares
0
Downloads
0
Comments
0
Likes
9
Embeds 0
No embeds

No notes for slide

Dynamic Namespace Partitioning with Giraffa File System

  1. Dynamic Namespace Partitioning withThe Giraffa File SystemKonstantin V. Shvachko Plamen JeliazkovFounder, Altoscale UC San Diego June 14, 2012 Hadoop Summit 2012 AltoScale
  2. AltoScale IntroductionPlamen Fresh grad from UCSD Internship with Hadoop Platform Team at eBay Wrote Giraffa prototypeKonstantin Founder of Altoscale. Primary focus 1. Altoscale Workbench Hadoop & HBase cluster on a public or a private cloud 2. Giraffa Apache Hadoop PMC HDFS scalabilty2
  3. AltoScale ContentsBackgroundMotivationArchitectureMain problems and solutions Bootstrapping Namespace Partitioning Rename3
  4. AltoScale GiraffaGiraffa is a distributed, highly available file systemUtilizes features of HDFS and HBaseNew open source project in experimental stage4
  5. AltoScale Origin:Giraffe. Latin: Giraffa camelopardalis Family Giraffidae Genus Giraffa Species Giraffa camelopardalisOther languages Arabic Zarafa Spanish Jirafa Bulgarian жирафа Italian GiraffaFavorites of my daughter o As the Hadoop traditions require5
  6. AltoScale Apache HadoopA reliable, scalable, high performance distributed computing systemThe Hadoop Distributed File System (HDFS) Reliable storage layerMapReduce – distributed computation framework Simple computational modelHadoop scales computation capacity, storage capacity, and I/O bandwidth by adding commodity servers.6
  7. AltoScale The Design PrinciplesLinear scalability More nodes can do more work within the same time On Data size and Compute resourcesReliability and Availability 1 drive fails in 3 years. Probability of failing today 1/1000. Several drives fail on a cluster with thousands of drivesMove computation to data Minimize expensive data transfersSequential data processing Avoid random reads7
  8. AltoScale Collocated Hadoop ClustersHDFS – a distributed file system NameNode – namespace and block management DataNodes – block replica containerMapReduce – a framework for distributed computations JobTracker – job scheduling, resource management, lifecycle coordination TaskTracker – task execution module NameNode JobTracker TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode8
  9. AltoScale Hadoop Distributed File SystemThe namespace is a hierarchy of files and directories Files are divided into large blocks 128 MBNamespace (metadata) is decoupled from data Fast namespace operations, not slowed down by Direct data streaming from the source storageSingle NameNode keeps the entire name space in RAMDataNodes store block replicas as files on local drives Blocks replicated on 3 DataNodes for redundancy & availabilityHDFS client – point of entry to HDFS Contacts NameNode for metadata Serves data to applications directly from DataNodes9
  10. AltoScale Scalability LimitsSingle-master architecture: a constraining resourceNameNode space limit 100 million files and 200 million blocks with 64GB RAM Restricts storage capacity to 20 PB Small file problem: block-to-file ratio is shrinkingSingle NameNode limits linear performance growth A handful of clients can saturate NameNodeMapReduce framework scalability limit: 40,000 clients Corresponds to a 4,000-node cluster with 10 MapReduce slots“HDFS Scalability: The limits to growth” USENIX ;login: 201010
  11. AltoScale Horizontal to Vertical ScalingHorizontal scaling is limited by single-master architectureNatural growth of compute power and storage density Clusters composed of more powerful serversVertical scaling leads to cluster size shrinkingStorage capacity, Compute power, and Cost remain constant11
  12. AltoScale Shrinking Clusters 2008 Yahoo!Resources per node: Cores, Disks, RAM 4000-node cluster 2010 Facebook 2000 nodes 2011 eBay 1000 nodes 2013 Cluster of 500 nodes Cluster Size: Number of Nodes12
  13. AltoScale Scalability for Hadoop 2.0HDFS Federation Independent NameNodes sharing a common pool of DataNodes Cluster is a family of volumes with shared block storage layer User sees volumes as isolated file systems ViewFS: the client-side mount tableYarn: New MapReduce framework Dynamic partitioning of cluster resources: no fixed slots Separation of JobTracker functions 1. Job scheduling and resource allocation: centralized 2. Job monitoring and job life-cycle coordination: decentralized o Delegate coordination of different jobs to other nodes13
  14. AltoScale Namespace PartitioningStatic: Federation Directory sub-trees are statically assigned to disjoint volumes Relocating sub-trees without copying is challenging Scale x10: billions of filesDynamic: Files, directory sub-trees can move automatically between nodes based on their utilization or load balancing requirements Files can be relocated without copying data blocks Scale x100: 100s of billion of filesOrthogonal independent approaches. Federation of distributed namespaces is possible14
  15. AltoScale Distributed Namespaces TodayCeph Metadata stored on OSD MDS cache metadata: Dynamic PartitioningLustre Plans to release (2.4) distributed namespace Code readyColossus: from Google S.Quinlan and J.Dean 100 million files per metadata server Hundreds of serversVoldFS, CassandraFS, KTHFS (MySQL) Prototypes15
  16. AltoScale HBase OverviewTable: big, sparse, loosely structured Collection of rows, sorted by row keys Rows can have arbitrary number of columnsTable is split Horizontally into Regions Dynamic Table partitioning! Region Servers serve regions to applicationsColumns grouped into Column families Vertical partition of tablesDistributed Cache: Regions are loaded in nodes’ RAM Real-time access to data16
  17. AltoScale HBase APIHBaseAdmin: administrative functions Create, delete, list tables Create, update, delete columns, column families Split, compact, flushHTable: access table data Result HTable.get(Get g) // get cells of a row void HTable.put(Put p) // update a row void HTable.put(Put[] p) // batch update of rows void HTable.delete(Delete d) // delete cells/row ResultScanner getScanner(family) // scan col familyCoprocessors: Custom actions triggered by update events Like database triggers and stored procedures17
  18. AltoScale HBase Architecture18
  19. AltoScale Giraffa File SystemHDFS + HBase = Giraffa Goal: build from existing building blocks Minimize changes to existing components1. Store file & directory metadata in HBase table Dynamic table partitioning into regions Cashed in RegionServer RAM for fast access2. Store file data in HDFS DataNodes: data streaming3. Block management Handle communication with DataNodes Perform block replication19
  20. AltoScale Giraffa RequirementsMore files & more dataAvailability Load balancing of metadata traffic Same data streaming speed to / from DataNodes No SPOFCluster operability, management Cost of running larger clusters same as smaller ones HDFS Federated HDFS GiraffaSpace 25 PB 120 PB 1 EB = 1000 PBFiles + blocks 200 million 1 billion 100 billionConcurrent Clients 40,000 100,000 1 million20
  21. AltoScale FAQ: Why HDFS and HBase?Building new FS from scratch – Really hard, Takes yearsHDFS a reliable, scalable block storage Efficient Data Streaming Automatic Data RecoveryHBase a natural metadata service Distributed Cache … Dynamic Partitioning Automatic Metadata RecoverySame breed, should be “compatible” HBase stores data in HDFS: same storage for data and metadata21
  22. AltoScale FAQ: Why not store whole files in HBase tables?Defeats the main concept of Distributed File Systems: Decoupling of data and metadataSmall files can be stored as rows Row size is limited by Region size Large files must be splitTechnically possible to split any information into rows o Log files: into events o Video files: into frames o Random bits: into 1K blobs with an offset as a row key Different level of abstraction Requires data conversion22
  23. AltoScale FAQ: My Dataset is Only 1 PB Do I Still Need Giraffa?Availability Distributed access to namespace for many concurrent clients Not bottlenecked by single NameNode performance“Small files” Block-to-file ration is decreasing: 2 –> 1.5 -> 1.2 No need to aggregate small files into large archives23
  24. AltoScale Building BlocksSingle Table called “Namespace” stores File ID (row key) and file attributes: o name, replication, block-size, permissions, times List of blocks Block locationsGiraffa client: FileSystem implementation Obtains metadata from HBase Data exchange with DataNodesBlock manager: maintain flat namespace of blocks Block allocation, replication, removal DataNode management Storage for the HBase table24
  25. AltoScale Giraffa Architecture HBase Namespace 1. Giraffa client path, attrs, block[], DN[][], BM-node gets files and blocks 1 Block Management Agent from HBase 2. May directly query Block NamespaceAgent App 2 Block Management Layer Manager 3. Stream data BM BM BM 3 to or from DataNodes DN DN DN DN DN DN DN DN DN25
  26. AltoScale Namespace TableRow keys Identify files and directories as rows in the table Different key definitions based on locality requirement Key definition is chosen during formatting of the file system Full-path-key is the defaultColumns File attributes: o Local name, owner, group, permissions, access-time, modification-time, block-size, replication, isDir, length List of blocks of a file o Persisted in the table List of block locations for each block o Not persisted, but discovered from block reports Directory table maps dir-entry name to corresponding row key26
  27. AltoScale Giraffa ClientGiraffaFileSystem implements FileSystem fs.defaultFS = grfa:/// fs.grfa.impl = o.a.giraffa.GiraffaFileSystemGiraffaClient extends DFSClient NamespaceAgent replaces NameNode RPC Namespace GiraffaFileSystem Agent GiraffaClient DFSClient to NameNode to DataNodes27
  28. AltoScale Block ManagementBlock Manager Block allocation, deletion, replicationDataNode Manager Process DataNode block reports, heartbeats. Identify lost nodesProvide storage for HBase table Small file system to store HFilesBMServer paired on the same node with RegionServer Distributed cluster of BMServes Mostly local communication between Region and BM serversNameNode is an initial implementation of BMServer Giraffa block is a single block file with the same name as block id28
  29. AltoScale Three ProblemsBootstrapping HBase stores tables as files in HDFSNamespace Partitioning Retain localityAtomic Renames29
  30. AltoScale Bootstrapping Block Manager Server .log HBase Volume hbase/ region1 Table layout / Rare updates giraffa/ region2 blk_123_001 dn-1 dn-2 dn-3 Block Volume blk_234_002 dn-11 dn-12 dn-13 Flat namespace of blocks blk_345_003 dn-101 dn-102 dn-10330
  31. AltoScale Locality of ReferenceRow keys Define sorting of files and directories in the table Tree structured namespace is flattened into linear arrayOrdered list of files is self-partitioned into regionsRetain locality in linearized structureFiles in the same directory - adjacent in the table Belong to the same region with some exclusionsFiles of the same directory should be on the same node Avoid jumping cross regions for simple “ls”31
  32. AltoScale Partitioning Example 1Straightforward partitioning based on random hashing 1 2 3 4 1 1 5 6 T1 T2 T3 T4 id1 id2 id332
  33. AltoScale Partitioning Example 2Partitioning based on lexicographic full-path ordering The default 1 2 3 4 15 16 T1 T2 T3 T4 1 1 1 1 1 2 2 T1 T2 3 T3 4 T4 1533
  34. AltoScale Partitioning Example 3Partitioning based on fixed depth neighborhoods 1 2 3 4 1 1 5 6 T1 T2 T3 T4 1 1 1 1 2 2 2 3 4 15 1634
  35. AltoScale Atomic RenameGiraffa will implement atomic in-place rename No support for atomic file move from one directory to anotherA move can then be implemented on application level Non-atomically move the target file from the source directory to a temporary file in the target directory Atomically rename the temporary file to its original name On failure use simple 3-step recovery procedureEventually implement atomic moves PAXOS Simplified synchronization algorithms35
  36. AltoScale History(2008) Idea. Study of distributed systems AFS, Lustre, Ceph, PVFS, GPFS, Farsite, … Partitioning of the namespace: 4 types of partitioning(2009) Study on scalability limits NameNode optimization(2010) Design with Michael Stack Presentation at HDFS contributors meeting(2011) Plamen implements POC(2012) Rewrite open sourced as Apache Extras project http://code.google.com/a/apache-extras.org/p/giraffa/36
  37. AltoScale Status Design stage One node cluster running Live demo with Plamen37
  38. AltoScale Thank You!38

×