Sep 2012 HUG: Giraffa File System to Grow Hadoop Bigger

The Giraffa File System

Konstantin V. Shvachko
Alto Storage Technologies
Storage

September 19, 2012 Hadoop User Group

AltoStor

AltoStor
Giraffa

Giraffa is a distributed,
highly available file system

Utilizes features of
HDFS and HBase

New open source project
in experimental stage

2

AltoStor
Apache Hadoop

A reliable, scalable, high performance distributed
storage and computing system
The Hadoop Distributed File System (HDFS)
Reliable storage layer

MapReduce – distributed computation framework
Simple computational model

Ecosystem of Big Data tools
HBase, Zookeeper

3

AltoStor
The Design Principles

Linear scalability
More nodes can do more work within the same time
On Data size and Compute resources

Reliability and Availability
1 drive fails in 3 years. Probability of failing today 1/1000.
Several drives fail every day on a cluster with thousands of drives

Move computation to data
Minimize expensive data transfers

Sequential data processing
Avoid random reads. [Use HBase for random data access]

4

AltoStor
Hadoop Cluster

HDFS – a distributed file system
NameNode – namespace and block management
DataNodes – block replica container

MapReduce – a framework for distributed computations
JobTracker – job scheduling, resource management, lifecycle
coordination
TaskTracker – task execution module

NameNode JobTracker

TaskTracker TaskTracker TaskTracker

DataNode DataNode DataNode

5

AltoStor
Hadoop Distributed File System

The namespace is a hierarchy of files and directories
Files are divided into large blocks 128 MB

Namespace (metadata) is decoupled from data
Fast namespace operations, not slowed down by
Direct data streaming from the source storage

Single NameNode keeps entire namespace in RAM
DataNodes store block replicas as files on local drives
Blocks replicated on 3 DataNodes for redundancy & availability

HDFS client – point of entry to HDFS
Contacts NameNode for metadata
Serves data to applications directly from DataNodes
6

AltoStor
Scalability Limits

Single-master architecture: a constraining resource
Single NameNode limits linear performance growth
A handful of “bad” clients can saturate NameNode

Single point of failure: takes whole cluster out of service
NameNode space limit
100 million files and 200 million blocks with 64GB RAM
Restricts storage capacity to 20 PB
Small file problem: block-to-file ratio is shrinking

“HDFS Scalability: The limits to growth” USENIX ;login: 2010

7

AltoStor
Node Count Visualization

2008 Yahoo!
Resources per node: Cores, Disks, RAM

4000-node cluster
2010 Facebook
2000 nodes
2011 eBay
1000 nodes
2013 Cluster of
500 nodes
Cluster Size: Number of Nodes

8

AltoStor
Horizontal to Vertical Scaling

Horizontal scaling is limited by single-master architecture
Natural growth of compute power and storage density
Clusters composed of more dense & powerful servers

Vertical scaling leads to cluster size shrinking
Storage capacity, Compute power, and Cost remain constant

Exponential Information Growth
2006 Chevron accumulates 2 TB a day
2012 Facebook ingests 500 TB a day

9

AltoStor
Scalability for Hadoop 2.0

HDFS Federation
Independent NameNodes sharing a common pool of DataNodes
Cluster is a family of volumes with shared block storage layer
User sees volumes as isolated file systems
ViewFS: the client-side mount table

Yarn: New MapReduce framework
Dynamic partitioning of cluster resources: no fixed slots
Separation of JobTracker functions
1. Job scheduling and resource allocation: centralized
2. Job monitoring and job life-cycle coordination: decentralized
o Delegate coordination of different jobs to other nodes

10

AltoStor
Namespace Partitioning

Static: Federation
Directory sub-trees are statically assigned to
disjoint volumes
Relocating sub-trees without copying is
challenging
Scale x10: billions of files
Dynamic:
Files, directory sub-trees can move automatically
between nodes based on their utilization or load
balancing requirements
Files can be relocated without copying data blocks
Scale x100: 100s of billion of files
Orthogonal independent approaches.
Federation of distributed namespaces is possible

11

AltoStor
Giraffa File System

HDFS + HBase = Giraffa
Goal: build from existing building blocks
Minimize changes to existing components

1. Store file & directory metadata in HBase table
Dynamic table partitioning into regions
Cashed in RegionServer RAM for fast access

2. Store file data in HDFS DataNodes: data streaming
3. Block management
Handle communication with DataNodes:
heartbeat, blockReport, addBlock
Perform block allocation, replication, and deletion

12

AltoStor
Giraffa Requirements

Availability – the primary goal
Load balancing of metadata traffic
Same data streaming speed to / from DataNodes
Continuous Availability: No SPOF

Cluster operability, management
Cost of running larger clusters same as a smaller one

More files & more data
HDFS Federated HDFS Giraffa
Space 25 PB 120 PB 1 EB = 1000 PB
Files + blocks 200 million 1 billion 100 billion
Concurrent Clients 40,000 100,000 1 million

13

AltoStor
HBase Overview

Table: big, sparse, loosely structured
Collection of rows, sorted by row keys
Rows can have arbitrary number of columns

Dynamic Table partitioning!
Table is split Horizontally into Regions
Region Servers serve regions to applications
Columns grouped into Column families: vertical partition of tables

Distributed Cache:
Regions are loaded in nodes’ RAM
Real-time access to data

14

AltoStor
HBase Architecture

15

AltoStor
HBase API

HBaseAdmin: administrative functions
Create, delete, list tables
Create, update, delete columns, column families
Split, compact, flush

HTable: access table data
Result HTable.get(Get g) // get cells of a row
void HTable.put(Put p) // update a row
void HTable.delete(Delete d) // delete cells/row
ResultScanner getScanner(family) // scan col family
Variety Filters

Coprocessors:
Custom actions triggered by update events
Like database triggers or stored procedures

16

AltoStor
Building Blocks

Giraffa clients
Fetch file & block metadata from Namespace Service
Exchange data with DataNodes
Namespace Service
HBase Table stores File metadata as rows
Block Management
Distributed collection of Giraffa block metadata
Data Management
DataNodes. Distributed collection of data blocks

17

AltoStor
Giraffa Architecture

Namespace Service HBase
Namespace Table 1. Giraffa client
path, attrs, block[], DN[][] gets files
and blocks
1 Block Management Processor from HBase

2 2. Block
NamespaceAgent

Manager
App Block Management Layer handles
block
BM BM BM operations
3
3. Stream data
DN DN DN
DN DN DN to or from
DN DN DN
DataNodes

18

AltoStor
Giraffa Client

GiraffaFileSystem implements FileSystem
fs.defaultFS = grfa:///
fs.grfa.impl = o.a.giraffa.GiraffaFileSystem

GiraffaClient extends DFSClient
NamespaceAgent replaces NameNode RPC

Namespace
GiraffaFileSystem

Agent
GiraffaClient

DFSClient

to Namespace

to DataNodes

19

AltoStor
Namespace Table

Single Table called “Namespace” stores
Row Key = File ID
File attributes:
o Local name, owner, group, permissions, access-time,
modification-time, block-size, replication, isDir, length
List of blocks of a file
o Persisted in the table
List of block locations for each block
o Not persisted, but discovered from the BlockManager
Directory table
o maps directory entry name to respective child row key

20

AltoStor
Namespace Service

HBase Namespace Service
Region Server Region Server Region Server

Region Region Region
NS Processor

NS Processor
NS Processor
1
…
… … …

BM Processor BM Processor BM Processor

2

Block Management Layer

21

AltoStor
Block Manager

Maintains flat namespace of Giraffa block metadata
1. Block management
Block allocation, deletion, replication
2. DataNode management
Process DataNode block reports, heartbeats. Identify lost nodes
3. Storage for the HBase table
Small file system to store Hfiles, HLog

BM Server paired on the same node with RegionServer
Distributed cluster of BMServes
Mostly local communication between Region and BM Servers

NameNode as an initial implementation of BMServer

22

AltoStor
Data Management

DataNodes Store and Report data blocks;
Blocks are files on local drives
Data transfer to and from clients
Internal data transfers
Same as HDFS

23

AltoStor
Row Key Design

Row keys
Identify files and directories as rows in the table
Define sorting of rows in Namespace table
And therefore Namespace partitioning
Different row key definitions based on locality
requirement
Key definition is chosen during file system formatting
Full-path-key is the default implementation
Problem: Rename can move object to another region
Row keys based on INode numbers

24

AltoStor
Locality of Reference

Files in the same directory – adjacent in the table
Belong to the same region (most of the time)
Efficient “ls”. Avoid jumping across regions
Row keys define sorting of files and directories in the
table
Tree structured namespace is flattened into linear array
Ordered list of files is self-partitioned into regions
How to retain tree locality in linearized structure

25

AltoStor
Partitioning: Random

Straightforward partitioning based on random hashing

1

2 3 4

15 16

T1 T2 T3 T4

id1 id2 id3

26

AltoStor
Partitioning: Full Subtrees

Partitioning based on lexicographic full-path ordering
The default for Giraffa

1

2 3 4

15 16

T1 T2 T3 T4

1 1 1 1 1
2 2
T1 T2 3
T3 4
T4
15

27

AltoStor
Partitioning: Fixed Neighborhood

Partitioning based on fixed depth neighborhoods

1

2 3 4

15 16

T1 T2 T3 T4

1 1 1 1 2 2
2 3 4 15 16

28

AltoStor
Atomic Rename

Giraffa will implement atomic in-place rename
No support for atomic file move from one directory to another
Requires inode numbers as unique file IDs

A move can then be implemented on application level
Non-atomically move the file from the source directory to a
temporary file in the target directory
Atomically rename the temporary file to its original name
On failure use simple 3-step recovery procedure

Eventually implement atomic moves
PAXOS
Simplified synchronization algorithms (ZAB)

29

AltoStor
3-Step Recovery Procedure

Move of a file from srcDir to trgDir failed
1. If only the source file exists, then start the move over
2. If only the target temporary file exists, then complete
the move by renaming the temporary file to the original
name
3. If both the source and the temporary target file exist,
then remove the source and rename the temporary file
This step is non-atomic and may fail as well.
In case of failure repeat the recovery procedure

30

AltoStor
New Giraffa Functionality

Custom file attributes: user defined file metadata
Hidden in complex file names or nested directories
o /logs/2012/08/31/server-ip.log
Stored in Zookeeper or even stand-alone DBs
o Involves Synchronization
Advanced Scanning, Grouping, Filtering
Amazon S3 API turns Giraffa into reliable storage on the cloud

Versioning
Based on HBase row versioning
Restore objects deleted inadvertently
Alternative approach for snapshots

31

AltoStor
Status

We are on Apache Extra
One node cluster running
Row Key abstraction
HBase implementation in separate package
Other DBs or Key-Value stores can be plugged in

Infrastructure: Eclipse, Findbugs, JavaDoc, Ivy, Jenkins, Wiki
Server-side processing FS requests. HBase endpoints
Testing Giraffa with TestHDFSCLI
Web UI. Multi-node cluster. Release…
32

AltoStor
Thank You!

33

AltoStor
Related Work

Ceph
Metadata stored on OSD
MDS cache metadata: Dynamic Partitioning

Lustre
Plans to release (2.4) distributed namespace
Code ready

Colossus: from Google S.Quinlan and J.Dean
100 million files per metadata server
Hundreds of servers

VoldFS, CassandraFS, KTHFS (MySQL)
Prototypes

MapR distributed file system
34

AltoStor
History

(2008) Idea. Study of distributed systems
AFS, Lustre, Ceph, PVFS, GPFS, Farsite, …
Partitioning of the namespace: 4 types of partitioning

(2009) Study on scalability limits
NameNode optimization

(2010) Design with Michael Stack
Presentation at HDFS contributors meeting

(2011) Plamen implements POC
(2012) Rewrite open sourced as Apache Extras project
http://code.google.com/a/apache-extras.org/p/giraffa/

35

AltoStor
Etymology

Giraffe. Latin: Giraffa camelopardalis
Family Giraffidae
Genus Giraffa
Species Giraffa camelopardalis

Other languages
Arabic Zarafa
Spanish Jirafa
Bulgarian жирафа
Italian Giraffa

Favorites of my daughter
o As the Hadoop traditions require

36

Sep 2012 HUG: Giraffa File System to Grow Hadoop Bigger

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Sep 2012 HUG: Giraffa File System to Grow Hadoop Bigger

Similar to Sep 2012 HUG: Giraffa File System to Grow Hadoop Bigger (20)

More from Yahoo Developer Network

More from Yahoo Developer Network (20)

Recently uploaded

Recently uploaded (20)

Sep 2012 HUG: Giraffa File System to Grow Hadoop Bigger