Sep 2012 HUG: Giraffa File System to Grow Hadoop Bigger
Upcoming SlideShare
Loading in...5

Sep 2012 HUG: Giraffa File System to Grow Hadoop Bigger



HDFS scalability and availability is limited by the single namespace server design. Giraffa is an experimental file system, which uses HBase to maintain the file system namespace in a distributed way ...

HDFS scalability and availability is limited by the single namespace server design. Giraffa is an experimental file system, which uses HBase to maintain the file system namespace in a distributed way and serves data directly from HDFS DataNodes. Giraffa is intended to provide higher scalabilty, availability, and maintain very large namespaces. The presentation will explain the Giraffa architecture, the motivation, will address its main challenges, and give an update on the status of the project.

Presenter: Konstantin Shvachko (PhD), Founder, AltoScale



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Sep 2012 HUG: Giraffa File System to Grow Hadoop Bigger Sep 2012 HUG: Giraffa File System to Grow Hadoop Bigger Presentation Transcript

  • The Giraffa File System Konstantin V. Shvachko Alto Storage Technologies Storage September 19, 2012 Hadoop User GroupAltoStor
  • AltoStor Giraffa Giraffa is a distributed, highly available file system Utilizes features of HDFS and HBase New open source project in experimental stage 2
  • AltoStor Apache Hadoop A reliable, scalable, high performance distributed storage and computing system The Hadoop Distributed File System (HDFS) Reliable storage layer MapReduce – distributed computation framework Simple computational model Ecosystem of Big Data tools HBase, Zookeeper 3
  • AltoStor The Design Principles Linear scalability More nodes can do more work within the same time On Data size and Compute resources Reliability and Availability 1 drive fails in 3 years. Probability of failing today 1/1000. Several drives fail every day on a cluster with thousands of drives Move computation to data Minimize expensive data transfers Sequential data processing Avoid random reads. [Use HBase for random data access] 4
  • AltoStor Hadoop Cluster HDFS – a distributed file system NameNode – namespace and block management DataNodes – block replica container MapReduce – a framework for distributed computations JobTracker – job scheduling, resource management, lifecycle coordination TaskTracker – task execution module NameNode JobTracker TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode 5
  • AltoStor Hadoop Distributed File System The namespace is a hierarchy of files and directories Files are divided into large blocks 128 MB Namespace (metadata) is decoupled from data Fast namespace operations, not slowed down by Direct data streaming from the source storage Single NameNode keeps entire namespace in RAM DataNodes store block replicas as files on local drives Blocks replicated on 3 DataNodes for redundancy & availability HDFS client – point of entry to HDFS Contacts NameNode for metadata Serves data to applications directly from DataNodes 6
  • AltoStor Scalability Limits Single-master architecture: a constraining resource Single NameNode limits linear performance growth A handful of “bad” clients can saturate NameNode Single point of failure: takes whole cluster out of service NameNode space limit 100 million files and 200 million blocks with 64GB RAM Restricts storage capacity to 20 PB Small file problem: block-to-file ratio is shrinking “HDFS Scalability: The limits to growth” USENIX ;login: 2010 7
  • AltoStor Node Count Visualization 2008 Yahoo! Resources per node: Cores, Disks, RAM 4000-node cluster 2010 Facebook 2000 nodes 2011 eBay 1000 nodes 2013 Cluster of 500 nodes Cluster Size: Number of Nodes 8
  • AltoStor Horizontal to Vertical Scaling Horizontal scaling is limited by single-master architecture Natural growth of compute power and storage density Clusters composed of more dense & powerful servers Vertical scaling leads to cluster size shrinking Storage capacity, Compute power, and Cost remain constant Exponential Information Growth 2006 Chevron accumulates 2 TB a day 2012 Facebook ingests 500 TB a day 9
  • AltoStor Scalability for Hadoop 2.0 HDFS Federation Independent NameNodes sharing a common pool of DataNodes Cluster is a family of volumes with shared block storage layer User sees volumes as isolated file systems ViewFS: the client-side mount table Yarn: New MapReduce framework Dynamic partitioning of cluster resources: no fixed slots Separation of JobTracker functions 1. Job scheduling and resource allocation: centralized 2. Job monitoring and job life-cycle coordination: decentralized o Delegate coordination of different jobs to other nodes 10
  • AltoStor Namespace Partitioning Static: Federation Directory sub-trees are statically assigned to disjoint volumes Relocating sub-trees without copying is challenging Scale x10: billions of files Dynamic: Files, directory sub-trees can move automatically between nodes based on their utilization or load balancing requirements Files can be relocated without copying data blocks Scale x100: 100s of billion of files Orthogonal independent approaches. Federation of distributed namespaces is possible 11
  • AltoStor Giraffa File System HDFS + HBase = Giraffa Goal: build from existing building blocks Minimize changes to existing components 1. Store file & directory metadata in HBase table Dynamic table partitioning into regions Cashed in RegionServer RAM for fast access 2. Store file data in HDFS DataNodes: data streaming 3. Block management Handle communication with DataNodes: heartbeat, blockReport, addBlock Perform block allocation, replication, and deletion 12
  • AltoStor Giraffa Requirements Availability – the primary goal Load balancing of metadata traffic Same data streaming speed to / from DataNodes Continuous Availability: No SPOF Cluster operability, management Cost of running larger clusters same as a smaller one More files & more data HDFS Federated HDFS Giraffa Space 25 PB 120 PB 1 EB = 1000 PB Files + blocks 200 million 1 billion 100 billion Concurrent Clients 40,000 100,000 1 million 13
  • AltoStor HBase Overview Table: big, sparse, loosely structured Collection of rows, sorted by row keys Rows can have arbitrary number of columns Dynamic Table partitioning! Table is split Horizontally into Regions Region Servers serve regions to applications Columns grouped into Column families: vertical partition of tables Distributed Cache: Regions are loaded in nodes’ RAM Real-time access to data 14
  • AltoStor HBase Architecture 15
  • AltoStor HBase API HBaseAdmin: administrative functions Create, delete, list tables Create, update, delete columns, column families Split, compact, flush HTable: access table data Result HTable.get(Get g) // get cells of a row void HTable.put(Put p) // update a row void HTable.delete(Delete d) // delete cells/row ResultScanner getScanner(family) // scan col family Variety Filters Coprocessors: Custom actions triggered by update events Like database triggers or stored procedures 16
  • AltoStor Building Blocks Giraffa clients Fetch file & block metadata from Namespace Service Exchange data with DataNodes Namespace Service HBase Table stores File metadata as rows Block Management Distributed collection of Giraffa block metadata Data Management DataNodes. Distributed collection of data blocks 17
  • AltoStor Giraffa Architecture Namespace Service HBase Namespace Table 1. Giraffa client path, attrs, block[], DN[][] gets files and blocks 1 Block Management Processor from HBase 2 2. Block NamespaceAgent Manager App Block Management Layer handles block BM BM BM operations 3 3. Stream data DN DN DN DN DN DN to or from DN DN DN DataNodes 18
  • AltoStor Giraffa Client GiraffaFileSystem implements FileSystem fs.defaultFS = grfa:/// fs.grfa.impl = o.a.giraffa.GiraffaFileSystem GiraffaClient extends DFSClient NamespaceAgent replaces NameNode RPC Namespace GiraffaFileSystem Agent GiraffaClient DFSClient to Namespace to DataNodes 19
  • AltoStor Namespace Table Single Table called “Namespace” stores Row Key = File ID File attributes: o Local name, owner, group, permissions, access-time, modification-time, block-size, replication, isDir, length List of blocks of a file o Persisted in the table List of block locations for each block o Not persisted, but discovered from the BlockManager Directory table o maps directory entry name to respective child row key 20
  • AltoStor Namespace Service HBase Namespace Service Region Server Region Server Region Server Region Region Region NS Processor NS Processor NS Processor Region Region Region 1 … … … … Region Region Region BM Processor BM Processor BM Processor 2 Block Management Layer 21
  • AltoStor Block Manager Maintains flat namespace of Giraffa block metadata 1. Block management Block allocation, deletion, replication 2. DataNode management Process DataNode block reports, heartbeats. Identify lost nodes 3. Storage for the HBase table Small file system to store Hfiles, HLog BM Server paired on the same node with RegionServer Distributed cluster of BMServes Mostly local communication between Region and BM Servers NameNode as an initial implementation of BMServer 22
  • AltoStor Data Management DataNodes Store and Report data blocks; Blocks are files on local drives Data transfer to and from clients Internal data transfers Same as HDFS 23
  • AltoStor Row Key Design Row keys Identify files and directories as rows in the table Define sorting of rows in Namespace table And therefore Namespace partitioning Different row key definitions based on locality requirement Key definition is chosen during file system formatting Full-path-key is the default implementation Problem: Rename can move object to another region Row keys based on INode numbers 24
  • AltoStor Locality of Reference Files in the same directory – adjacent in the table Belong to the same region (most of the time) Efficient “ls”. Avoid jumping across regions Row keys define sorting of files and directories in the table Tree structured namespace is flattened into linear array Ordered list of files is self-partitioned into regions How to retain tree locality in linearized structure 25
  • AltoStor Partitioning: Random Straightforward partitioning based on random hashing 1 2 3 4 15 16 T1 T2 T3 T4 id1 id2 id3 26
  • AltoStor Partitioning: Full Subtrees Partitioning based on lexicographic full-path ordering The default for Giraffa 1 2 3 4 15 16 T1 T2 T3 T4 1 1 1 1 1 2 2 T1 T2 3 T3 4 T4 15 27
  • AltoStor Partitioning: Fixed Neighborhood Partitioning based on fixed depth neighborhoods 1 2 3 4 15 16 T1 T2 T3 T4 1 1 1 1 2 2 2 3 4 15 16 28
  • AltoStor Atomic Rename Giraffa will implement atomic in-place rename No support for atomic file move from one directory to another Requires inode numbers as unique file IDs A move can then be implemented on application level Non-atomically move the file from the source directory to a temporary file in the target directory Atomically rename the temporary file to its original name On failure use simple 3-step recovery procedure Eventually implement atomic moves PAXOS Simplified synchronization algorithms (ZAB) 29
  • AltoStor 3-Step Recovery Procedure Move of a file from srcDir to trgDir failed 1. If only the source file exists, then start the move over 2. If only the target temporary file exists, then complete the move by renaming the temporary file to the original name 3. If both the source and the temporary target file exist, then remove the source and rename the temporary file This step is non-atomic and may fail as well. In case of failure repeat the recovery procedure 30
  • AltoStor New Giraffa Functionality Custom file attributes: user defined file metadata Hidden in complex file names or nested directories o /logs/2012/08/31/server-ip.log Stored in Zookeeper or even stand-alone DBs o Involves Synchronization Advanced Scanning, Grouping, Filtering Amazon S3 API turns Giraffa into reliable storage on the cloud Versioning Based on HBase row versioning Restore objects deleted inadvertently Alternative approach for snapshots 31
  • AltoStor Status We are on Apache Extra One node cluster running Row Key abstraction HBase implementation in separate package Other DBs or Key-Value stores can be plugged in Infrastructure: Eclipse, Findbugs, JavaDoc, Ivy, Jenkins, Wiki Server-side processing FS requests. HBase endpoints Testing Giraffa with TestHDFSCLI Web UI. Multi-node cluster. Release… 32
  • AltoStor Thank You! 33
  • AltoStor Related Work Ceph Metadata stored on OSD MDS cache metadata: Dynamic Partitioning Lustre Plans to release (2.4) distributed namespace Code ready Colossus: from Google S.Quinlan and J.Dean 100 million files per metadata server Hundreds of servers VoldFS, CassandraFS, KTHFS (MySQL) Prototypes MapR distributed file system 34
  • AltoStor History (2008) Idea. Study of distributed systems AFS, Lustre, Ceph, PVFS, GPFS, Farsite, … Partitioning of the namespace: 4 types of partitioning (2009) Study on scalability limits NameNode optimization (2010) Design with Michael Stack Presentation at HDFS contributors meeting (2011) Plamen implements POC (2012) Rewrite open sourced as Apache Extras project 35
  • AltoStor Etymology Giraffe. Latin: Giraffa camelopardalis Family Giraffidae Genus Giraffa Species Giraffa camelopardalis Other languages Arabic Zarafa Spanish Jirafa Bulgarian жирафа Italian Giraffa Favorites of my daughter o As the Hadoop traditions require 36