HDFS scalability and availability is limited by the single namespace server design. Giraffa is an experimental file system, which uses HBase to maintain the file system namespace in a distributed way and serves data directly from HDFS DataNodes. Giraffa is intended to provide higher scalabilty, availability, and maintain very large namespaces. The presentation will explain the Giraffa architecture, the motivation, will address its main challenges, and give an update on the status of the project.
Presenter: Konstantin Shvachko (PhD), Founder, AltoScale
Sep 2012 HUG: Giraffa File System to Grow Hadoop Bigger
1. The Giraffa File System
Konstantin V. Shvachko
Alto Storage Technologies
Storage
September 19, 2012 Hadoop User Group
AltoStor
2. AltoStor
Giraffa
Giraffa is a distributed,
highly available file system
Utilizes features of
HDFS and HBase
New open source project
in experimental stage
2
3. AltoStor
Apache Hadoop
A reliable, scalable, high performance distributed
storage and computing system
The Hadoop Distributed File System (HDFS)
Reliable storage layer
MapReduce – distributed computation framework
Simple computational model
Ecosystem of Big Data tools
HBase, Zookeeper
3
4. AltoStor
The Design Principles
Linear scalability
More nodes can do more work within the same time
On Data size and Compute resources
Reliability and Availability
1 drive fails in 3 years. Probability of failing today 1/1000.
Several drives fail every day on a cluster with thousands of drives
Move computation to data
Minimize expensive data transfers
Sequential data processing
Avoid random reads. [Use HBase for random data access]
4
5. AltoStor
Hadoop Cluster
HDFS – a distributed file system
NameNode – namespace and block management
DataNodes – block replica container
MapReduce – a framework for distributed computations
JobTracker – job scheduling, resource management, lifecycle
coordination
TaskTracker – task execution module
NameNode JobTracker
TaskTracker TaskTracker TaskTracker
DataNode DataNode DataNode
5
6. AltoStor
Hadoop Distributed File System
The namespace is a hierarchy of files and directories
Files are divided into large blocks 128 MB
Namespace (metadata) is decoupled from data
Fast namespace operations, not slowed down by
Direct data streaming from the source storage
Single NameNode keeps entire namespace in RAM
DataNodes store block replicas as files on local drives
Blocks replicated on 3 DataNodes for redundancy & availability
HDFS client – point of entry to HDFS
Contacts NameNode for metadata
Serves data to applications directly from DataNodes
6
7. AltoStor
Scalability Limits
Single-master architecture: a constraining resource
Single NameNode limits linear performance growth
A handful of “bad” clients can saturate NameNode
Single point of failure: takes whole cluster out of service
NameNode space limit
100 million files and 200 million blocks with 64GB RAM
Restricts storage capacity to 20 PB
Small file problem: block-to-file ratio is shrinking
“HDFS Scalability: The limits to growth” USENIX ;login: 2010
7
8. AltoStor
Node Count Visualization
2008 Yahoo!
Resources per node: Cores, Disks, RAM
4000-node cluster
2010 Facebook
2000 nodes
2011 eBay
1000 nodes
2013 Cluster of
500 nodes
Cluster Size: Number of Nodes
8
9. AltoStor
Horizontal to Vertical Scaling
Horizontal scaling is limited by single-master architecture
Natural growth of compute power and storage density
Clusters composed of more dense & powerful servers
Vertical scaling leads to cluster size shrinking
Storage capacity, Compute power, and Cost remain constant
Exponential Information Growth
2006 Chevron accumulates 2 TB a day
2012 Facebook ingests 500 TB a day
9
10. AltoStor
Scalability for Hadoop 2.0
HDFS Federation
Independent NameNodes sharing a common pool of DataNodes
Cluster is a family of volumes with shared block storage layer
User sees volumes as isolated file systems
ViewFS: the client-side mount table
Yarn: New MapReduce framework
Dynamic partitioning of cluster resources: no fixed slots
Separation of JobTracker functions
1. Job scheduling and resource allocation: centralized
2. Job monitoring and job life-cycle coordination: decentralized
o Delegate coordination of different jobs to other nodes
10
11. AltoStor
Namespace Partitioning
Static: Federation
Directory sub-trees are statically assigned to
disjoint volumes
Relocating sub-trees without copying is
challenging
Scale x10: billions of files
Dynamic:
Files, directory sub-trees can move automatically
between nodes based on their utilization or load
balancing requirements
Files can be relocated without copying data blocks
Scale x100: 100s of billion of files
Orthogonal independent approaches.
Federation of distributed namespaces is possible
11
12. AltoStor
Giraffa File System
HDFS + HBase = Giraffa
Goal: build from existing building blocks
Minimize changes to existing components
1. Store file & directory metadata in HBase table
Dynamic table partitioning into regions
Cashed in RegionServer RAM for fast access
2. Store file data in HDFS DataNodes: data streaming
3. Block management
Handle communication with DataNodes:
heartbeat, blockReport, addBlock
Perform block allocation, replication, and deletion
12
13. AltoStor
Giraffa Requirements
Availability – the primary goal
Load balancing of metadata traffic
Same data streaming speed to / from DataNodes
Continuous Availability: No SPOF
Cluster operability, management
Cost of running larger clusters same as a smaller one
More files & more data
HDFS Federated HDFS Giraffa
Space 25 PB 120 PB 1 EB = 1000 PB
Files + blocks 200 million 1 billion 100 billion
Concurrent Clients 40,000 100,000 1 million
13
14. AltoStor
HBase Overview
Table: big, sparse, loosely structured
Collection of rows, sorted by row keys
Rows can have arbitrary number of columns
Dynamic Table partitioning!
Table is split Horizontally into Regions
Region Servers serve regions to applications
Columns grouped into Column families: vertical partition of tables
Distributed Cache:
Regions are loaded in nodes’ RAM
Real-time access to data
14
16. AltoStor
HBase API
HBaseAdmin: administrative functions
Create, delete, list tables
Create, update, delete columns, column families
Split, compact, flush
HTable: access table data
Result HTable.get(Get g) // get cells of a row
void HTable.put(Put p) // update a row
void HTable.delete(Delete d) // delete cells/row
ResultScanner getScanner(family) // scan col family
Variety Filters
Coprocessors:
Custom actions triggered by update events
Like database triggers or stored procedures
16
17. AltoStor
Building Blocks
Giraffa clients
Fetch file & block metadata from Namespace Service
Exchange data with DataNodes
Namespace Service
HBase Table stores File metadata as rows
Block Management
Distributed collection of Giraffa block metadata
Data Management
DataNodes. Distributed collection of data blocks
17
18. AltoStor
Giraffa Architecture
Namespace Service HBase
Namespace Table 1. Giraffa client
path, attrs, block[], DN[][] gets files
and blocks
1 Block Management Processor from HBase
2 2. Block
NamespaceAgent
Manager
App Block Management Layer handles
block
BM BM BM operations
3
3. Stream data
DN DN DN
DN DN DN to or from
DN DN DN
DataNodes
18
20. AltoStor
Namespace Table
Single Table called “Namespace” stores
Row Key = File ID
File attributes:
o Local name, owner, group, permissions, access-time,
modification-time, block-size, replication, isDir, length
List of blocks of a file
o Persisted in the table
List of block locations for each block
o Not persisted, but discovered from the BlockManager
Directory table
o maps directory entry name to respective child row key
20
21. AltoStor
Namespace Service
HBase Namespace Service
Region Server Region Server Region Server
Region Region Region
NS Processor
NS Processor
NS Processor
Region Region Region
1
…
… … …
Region Region Region
BM Processor BM Processor BM Processor
2
Block Management Layer
21
22. AltoStor
Block Manager
Maintains flat namespace of Giraffa block metadata
1. Block management
Block allocation, deletion, replication
2. DataNode management
Process DataNode block reports, heartbeats. Identify lost nodes
3. Storage for the HBase table
Small file system to store Hfiles, HLog
BM Server paired on the same node with RegionServer
Distributed cluster of BMServes
Mostly local communication between Region and BM Servers
NameNode as an initial implementation of BMServer
22
23. AltoStor
Data Management
DataNodes Store and Report data blocks;
Blocks are files on local drives
Data transfer to and from clients
Internal data transfers
Same as HDFS
23
24. AltoStor
Row Key Design
Row keys
Identify files and directories as rows in the table
Define sorting of rows in Namespace table
And therefore Namespace partitioning
Different row key definitions based on locality
requirement
Key definition is chosen during file system formatting
Full-path-key is the default implementation
Problem: Rename can move object to another region
Row keys based on INode numbers
24
25. AltoStor
Locality of Reference
Files in the same directory – adjacent in the table
Belong to the same region (most of the time)
Efficient “ls”. Avoid jumping across regions
Row keys define sorting of files and directories in the
table
Tree structured namespace is flattened into linear array
Ordered list of files is self-partitioned into regions
How to retain tree locality in linearized structure
25
26. AltoStor
Partitioning: Random
Straightforward partitioning based on random hashing
1
2 3 4
15 16
T1 T2 T3 T4
id1 id2 id3
26
27. AltoStor
Partitioning: Full Subtrees
Partitioning based on lexicographic full-path ordering
The default for Giraffa
1
2 3 4
15 16
T1 T2 T3 T4
1 1 1 1 1
2 2
T1 T2 3
T3 4
T4
15
27
29. AltoStor
Atomic Rename
Giraffa will implement atomic in-place rename
No support for atomic file move from one directory to another
Requires inode numbers as unique file IDs
A move can then be implemented on application level
Non-atomically move the file from the source directory to a
temporary file in the target directory
Atomically rename the temporary file to its original name
On failure use simple 3-step recovery procedure
Eventually implement atomic moves
PAXOS
Simplified synchronization algorithms (ZAB)
29
30. AltoStor
3-Step Recovery Procedure
Move of a file from srcDir to trgDir failed
1. If only the source file exists, then start the move over
2. If only the target temporary file exists, then complete
the move by renaming the temporary file to the original
name
3. If both the source and the temporary target file exist,
then remove the source and rename the temporary file
This step is non-atomic and may fail as well.
In case of failure repeat the recovery procedure
30
31. AltoStor
New Giraffa Functionality
Custom file attributes: user defined file metadata
Hidden in complex file names or nested directories
o /logs/2012/08/31/server-ip.log
Stored in Zookeeper or even stand-alone DBs
o Involves Synchronization
Advanced Scanning, Grouping, Filtering
Amazon S3 API turns Giraffa into reliable storage on the cloud
Versioning
Based on HBase row versioning
Restore objects deleted inadvertently
Alternative approach for snapshots
31
32. AltoStor
Status
We are on Apache Extra
One node cluster running
Row Key abstraction
HBase implementation in separate package
Other DBs or Key-Value stores can be plugged in
Infrastructure: Eclipse, Findbugs, JavaDoc, Ivy, Jenkins, Wiki
Server-side processing FS requests. HBase endpoints
Testing Giraffa with TestHDFSCLI
Web UI. Multi-node cluster. Release…
32
34. AltoStor
Related Work
Ceph
Metadata stored on OSD
MDS cache metadata: Dynamic Partitioning
Lustre
Plans to release (2.4) distributed namespace
Code ready
Colossus: from Google S.Quinlan and J.Dean
100 million files per metadata server
Hundreds of servers
VoldFS, CassandraFS, KTHFS (MySQL)
Prototypes
MapR distributed file system
34
35. AltoStor
History
(2008) Idea. Study of distributed systems
AFS, Lustre, Ceph, PVFS, GPFS, Farsite, …
Partitioning of the namespace: 4 types of partitioning
(2009) Study on scalability limits
NameNode optimization
(2010) Design with Michael Stack
Presentation at HDFS contributors meeting
(2011) Plamen implements POC
(2012) Rewrite open sourced as Apache Extras project
http://code.google.com/a/apache-extras.org/p/giraffa/
35
36. AltoStor
Etymology
Giraffe. Latin: Giraffa camelopardalis
Family Giraffidae
Genus Giraffa
Species Giraffa camelopardalis
Other languages
Arabic Zarafa
Spanish Jirafa
Bulgarian жирафа
Italian Giraffa
Favorites of my daughter
o As the Hadoop traditions require
36