The Giraffa File System   Konstantin V. Shvachko   Alto Storage Technologies        Storage    September 19, 2012         ...
AltoStor                                     Giraffa      Giraffa is a distributed,      highly available file system     ...
AltoStor                                        Apache Hadoop      A reliable, scalable, high performance distributed     ...
AltoStor                                     The Design Principles      Linear scalability        More nodes can do more w...
AltoStor                                                       Hadoop Cluster      HDFS – a distributed file system       ...
AltoStor                  Hadoop Distributed File System      The namespace is a hierarchy of files and directories       ...
AltoStor                                           Scalability Limits      Single-master architecture: a constraining reso...
AltoStor                                                            Node Count Visualization                              ...
AltoStor                      Horizontal to Vertical Scaling      Horizontal scaling is limited by single-master architect...
AltoStor                                 Scalability for Hadoop 2.0      HDFS Federation        Independent NameNodes shar...
AltoStor                                  Namespace Partitioning       Static: Federation         Directory sub-trees are ...
AltoStor                                         Giraffa File System      HDFS + HBase = Giraffa        Goal: build from e...
AltoStor                                     Giraffa Requirements      Availability – the primary goal        Load balanci...
AltoStor                                             HBase Overview      Table: big, sparse, loosely structured        Col...
AltoStor           HBase Architecture 15
AltoStor                                                      HBase API      HBaseAdmin: administrative functions        C...
AltoStor                                         Building Blocks      Giraffa clients        Fetch file & block metadata f...
AltoStor                                                      Giraffa Architecture                                    Name...
AltoStor                                                                               Giraffa Client      GiraffaFileSyst...
AltoStor                                               Namespace Table      Single Table called “Namespace” stores        ...
AltoStor                                                                  Namespace Service           HBase Namespace Serv...
AltoStor                                                  Block Manager      Maintains flat namespace of Giraffa block met...
AltoStor                                  Data Management      DataNodes Store and Report data blocks;      Blocks are fil...
AltoStor                                        Row Key Design      Row keys        Identify files and directories as rows...
AltoStor                                    Locality of Reference      Files in the same directory – adjacent in the table...
AltoStor                                    Partitioning: Random      Straightforward partitioning based on random hashing...
AltoStor                                       Partitioning: Full Subtrees      Partitioning based on lexicographic full-p...
AltoStor              Partitioning: Fixed Neighborhood      Partitioning based on fixed depth neighborhoods               ...
AltoStor                                              Atomic Rename      Giraffa will implement atomic in-place rename    ...
AltoStor                           3-Step Recovery Procedure      Move of a file from srcDir to trgDir failed  1. If only ...
AltoStor                                 New Giraffa Functionality      Custom file attributes: user defined file metadata...
AltoStor                                                            Status      We are on Apache Extra      One node clust...
AltoStor           Thank You! 33
AltoStor                                                 Related Work      Ceph        Metadata stored on OSD        MDS c...
AltoStor                                                                 History      (2008) Idea. Study of distributed sy...
AltoStor                                                Etymology      Giraffe. Latin: Giraffa camelopardalis      Family ...
Upcoming SlideShare
Loading in …5
×

Sep 2012 HUG: Giraffa File System to Grow Hadoop Bigger

2,245 views

Published on

HDFS scalability and availability is limited by the single namespace server design. Giraffa is an experimental file system, which uses HBase to maintain the file system namespace in a distributed way and serves data directly from HDFS DataNodes. Giraffa is intended to provide higher scalabilty, availability, and maintain very large namespaces. The presentation will explain the Giraffa architecture, the motivation, will address its main challenges, and give an update on the status of the project.

Presenter: Konstantin Shvachko (PhD), Founder, AltoScale

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,245
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
29
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Sep 2012 HUG: Giraffa File System to Grow Hadoop Bigger

  1. 1. The Giraffa File System Konstantin V. Shvachko Alto Storage Technologies Storage September 19, 2012 Hadoop User GroupAltoStor
  2. 2. AltoStor Giraffa Giraffa is a distributed, highly available file system Utilizes features of HDFS and HBase New open source project in experimental stage 2
  3. 3. AltoStor Apache Hadoop A reliable, scalable, high performance distributed storage and computing system The Hadoop Distributed File System (HDFS) Reliable storage layer MapReduce – distributed computation framework Simple computational model Ecosystem of Big Data tools HBase, Zookeeper 3
  4. 4. AltoStor The Design Principles Linear scalability More nodes can do more work within the same time On Data size and Compute resources Reliability and Availability 1 drive fails in 3 years. Probability of failing today 1/1000. Several drives fail every day on a cluster with thousands of drives Move computation to data Minimize expensive data transfers Sequential data processing Avoid random reads. [Use HBase for random data access] 4
  5. 5. AltoStor Hadoop Cluster HDFS – a distributed file system NameNode – namespace and block management DataNodes – block replica container MapReduce – a framework for distributed computations JobTracker – job scheduling, resource management, lifecycle coordination TaskTracker – task execution module NameNode JobTracker TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode 5
  6. 6. AltoStor Hadoop Distributed File System The namespace is a hierarchy of files and directories Files are divided into large blocks 128 MB Namespace (metadata) is decoupled from data Fast namespace operations, not slowed down by Direct data streaming from the source storage Single NameNode keeps entire namespace in RAM DataNodes store block replicas as files on local drives Blocks replicated on 3 DataNodes for redundancy & availability HDFS client – point of entry to HDFS Contacts NameNode for metadata Serves data to applications directly from DataNodes 6
  7. 7. AltoStor Scalability Limits Single-master architecture: a constraining resource Single NameNode limits linear performance growth A handful of “bad” clients can saturate NameNode Single point of failure: takes whole cluster out of service NameNode space limit 100 million files and 200 million blocks with 64GB RAM Restricts storage capacity to 20 PB Small file problem: block-to-file ratio is shrinking “HDFS Scalability: The limits to growth” USENIX ;login: 2010 7
  8. 8. AltoStor Node Count Visualization 2008 Yahoo! Resources per node: Cores, Disks, RAM 4000-node cluster 2010 Facebook 2000 nodes 2011 eBay 1000 nodes 2013 Cluster of 500 nodes Cluster Size: Number of Nodes 8
  9. 9. AltoStor Horizontal to Vertical Scaling Horizontal scaling is limited by single-master architecture Natural growth of compute power and storage density Clusters composed of more dense & powerful servers Vertical scaling leads to cluster size shrinking Storage capacity, Compute power, and Cost remain constant Exponential Information Growth 2006 Chevron accumulates 2 TB a day 2012 Facebook ingests 500 TB a day 9
  10. 10. AltoStor Scalability for Hadoop 2.0 HDFS Federation Independent NameNodes sharing a common pool of DataNodes Cluster is a family of volumes with shared block storage layer User sees volumes as isolated file systems ViewFS: the client-side mount table Yarn: New MapReduce framework Dynamic partitioning of cluster resources: no fixed slots Separation of JobTracker functions 1. Job scheduling and resource allocation: centralized 2. Job monitoring and job life-cycle coordination: decentralized o Delegate coordination of different jobs to other nodes 10
  11. 11. AltoStor Namespace Partitioning Static: Federation Directory sub-trees are statically assigned to disjoint volumes Relocating sub-trees without copying is challenging Scale x10: billions of files Dynamic: Files, directory sub-trees can move automatically between nodes based on their utilization or load balancing requirements Files can be relocated without copying data blocks Scale x100: 100s of billion of files Orthogonal independent approaches. Federation of distributed namespaces is possible 11
  12. 12. AltoStor Giraffa File System HDFS + HBase = Giraffa Goal: build from existing building blocks Minimize changes to existing components 1. Store file & directory metadata in HBase table Dynamic table partitioning into regions Cashed in RegionServer RAM for fast access 2. Store file data in HDFS DataNodes: data streaming 3. Block management Handle communication with DataNodes: heartbeat, blockReport, addBlock Perform block allocation, replication, and deletion 12
  13. 13. AltoStor Giraffa Requirements Availability – the primary goal Load balancing of metadata traffic Same data streaming speed to / from DataNodes Continuous Availability: No SPOF Cluster operability, management Cost of running larger clusters same as a smaller one More files & more data HDFS Federated HDFS Giraffa Space 25 PB 120 PB 1 EB = 1000 PB Files + blocks 200 million 1 billion 100 billion Concurrent Clients 40,000 100,000 1 million 13
  14. 14. AltoStor HBase Overview Table: big, sparse, loosely structured Collection of rows, sorted by row keys Rows can have arbitrary number of columns Dynamic Table partitioning! Table is split Horizontally into Regions Region Servers serve regions to applications Columns grouped into Column families: vertical partition of tables Distributed Cache: Regions are loaded in nodes’ RAM Real-time access to data 14
  15. 15. AltoStor HBase Architecture 15
  16. 16. AltoStor HBase API HBaseAdmin: administrative functions Create, delete, list tables Create, update, delete columns, column families Split, compact, flush HTable: access table data Result HTable.get(Get g) // get cells of a row void HTable.put(Put p) // update a row void HTable.delete(Delete d) // delete cells/row ResultScanner getScanner(family) // scan col family Variety Filters Coprocessors: Custom actions triggered by update events Like database triggers or stored procedures 16
  17. 17. AltoStor Building Blocks Giraffa clients Fetch file & block metadata from Namespace Service Exchange data with DataNodes Namespace Service HBase Table stores File metadata as rows Block Management Distributed collection of Giraffa block metadata Data Management DataNodes. Distributed collection of data blocks 17
  18. 18. AltoStor Giraffa Architecture Namespace Service HBase Namespace Table 1. Giraffa client path, attrs, block[], DN[][] gets files and blocks 1 Block Management Processor from HBase 2 2. Block NamespaceAgent Manager App Block Management Layer handles block BM BM BM operations 3 3. Stream data DN DN DN DN DN DN to or from DN DN DN DataNodes 18
  19. 19. AltoStor Giraffa Client GiraffaFileSystem implements FileSystem fs.defaultFS = grfa:/// fs.grfa.impl = o.a.giraffa.GiraffaFileSystem GiraffaClient extends DFSClient NamespaceAgent replaces NameNode RPC Namespace GiraffaFileSystem Agent GiraffaClient DFSClient to Namespace to DataNodes 19
  20. 20. AltoStor Namespace Table Single Table called “Namespace” stores Row Key = File ID File attributes: o Local name, owner, group, permissions, access-time, modification-time, block-size, replication, isDir, length List of blocks of a file o Persisted in the table List of block locations for each block o Not persisted, but discovered from the BlockManager Directory table o maps directory entry name to respective child row key 20
  21. 21. AltoStor Namespace Service HBase Namespace Service Region Server Region Server Region Server Region Region Region NS Processor NS Processor NS Processor Region Region Region 1 … … … … Region Region Region BM Processor BM Processor BM Processor 2 Block Management Layer 21
  22. 22. AltoStor Block Manager Maintains flat namespace of Giraffa block metadata 1. Block management Block allocation, deletion, replication 2. DataNode management Process DataNode block reports, heartbeats. Identify lost nodes 3. Storage for the HBase table Small file system to store Hfiles, HLog BM Server paired on the same node with RegionServer Distributed cluster of BMServes Mostly local communication between Region and BM Servers NameNode as an initial implementation of BMServer 22
  23. 23. AltoStor Data Management DataNodes Store and Report data blocks; Blocks are files on local drives Data transfer to and from clients Internal data transfers Same as HDFS 23
  24. 24. AltoStor Row Key Design Row keys Identify files and directories as rows in the table Define sorting of rows in Namespace table And therefore Namespace partitioning Different row key definitions based on locality requirement Key definition is chosen during file system formatting Full-path-key is the default implementation Problem: Rename can move object to another region Row keys based on INode numbers 24
  25. 25. AltoStor Locality of Reference Files in the same directory – adjacent in the table Belong to the same region (most of the time) Efficient “ls”. Avoid jumping across regions Row keys define sorting of files and directories in the table Tree structured namespace is flattened into linear array Ordered list of files is self-partitioned into regions How to retain tree locality in linearized structure 25
  26. 26. AltoStor Partitioning: Random Straightforward partitioning based on random hashing 1 2 3 4 15 16 T1 T2 T3 T4 id1 id2 id3 26
  27. 27. AltoStor Partitioning: Full Subtrees Partitioning based on lexicographic full-path ordering The default for Giraffa 1 2 3 4 15 16 T1 T2 T3 T4 1 1 1 1 1 2 2 T1 T2 3 T3 4 T4 15 27
  28. 28. AltoStor Partitioning: Fixed Neighborhood Partitioning based on fixed depth neighborhoods 1 2 3 4 15 16 T1 T2 T3 T4 1 1 1 1 2 2 2 3 4 15 16 28
  29. 29. AltoStor Atomic Rename Giraffa will implement atomic in-place rename No support for atomic file move from one directory to another Requires inode numbers as unique file IDs A move can then be implemented on application level Non-atomically move the file from the source directory to a temporary file in the target directory Atomically rename the temporary file to its original name On failure use simple 3-step recovery procedure Eventually implement atomic moves PAXOS Simplified synchronization algorithms (ZAB) 29
  30. 30. AltoStor 3-Step Recovery Procedure Move of a file from srcDir to trgDir failed 1. If only the source file exists, then start the move over 2. If only the target temporary file exists, then complete the move by renaming the temporary file to the original name 3. If both the source and the temporary target file exist, then remove the source and rename the temporary file This step is non-atomic and may fail as well. In case of failure repeat the recovery procedure 30
  31. 31. AltoStor New Giraffa Functionality Custom file attributes: user defined file metadata Hidden in complex file names or nested directories o /logs/2012/08/31/server-ip.log Stored in Zookeeper or even stand-alone DBs o Involves Synchronization Advanced Scanning, Grouping, Filtering Amazon S3 API turns Giraffa into reliable storage on the cloud Versioning Based on HBase row versioning Restore objects deleted inadvertently Alternative approach for snapshots 31
  32. 32. AltoStor Status We are on Apache Extra One node cluster running Row Key abstraction HBase implementation in separate package Other DBs or Key-Value stores can be plugged in Infrastructure: Eclipse, Findbugs, JavaDoc, Ivy, Jenkins, Wiki Server-side processing FS requests. HBase endpoints Testing Giraffa with TestHDFSCLI Web UI. Multi-node cluster. Release… 32
  33. 33. AltoStor Thank You! 33
  34. 34. AltoStor Related Work Ceph Metadata stored on OSD MDS cache metadata: Dynamic Partitioning Lustre Plans to release (2.4) distributed namespace Code ready Colossus: from Google S.Quinlan and J.Dean 100 million files per metadata server Hundreds of servers VoldFS, CassandraFS, KTHFS (MySQL) Prototypes MapR distributed file system 34
  35. 35. AltoStor History (2008) Idea. Study of distributed systems AFS, Lustre, Ceph, PVFS, GPFS, Farsite, … Partitioning of the namespace: 4 types of partitioning (2009) Study on scalability limits NameNode optimization (2010) Design with Michael Stack Presentation at HDFS contributors meeting (2011) Plamen implements POC (2012) Rewrite open sourced as Apache Extras project http://code.google.com/a/apache-extras.org/p/giraffa/ 35
  36. 36. AltoStor Etymology Giraffe. Latin: Giraffa camelopardalis Family Giraffidae Genus Giraffa Species Giraffa camelopardalis Other languages Arabic Zarafa Spanish Jirafa Bulgarian жирафа Italian Giraffa Favorites of my daughter o As the Hadoop traditions require 36

×