Dynamic Namespace Partitioning with Giraffa File System

Dynamic Namespace Partitioning with

The Giraffa File System

Konstantin V. Shvachko Plamen Jeliazkov
Founder, Altoscale UC San Diego

June 14, 2012 Hadoop Summit 2012

AltoScale

AltoScale Introduction

Plamen
Fresh grad from UCSD
Internship with Hadoop Platform Team at eBay
Wrote Giraffa prototype

Konstantin
Founder of Altoscale. Primary focus
1. Altoscale Workbench
Hadoop & HBase cluster on a public or a private cloud
2. Giraffa
Apache Hadoop PMC
HDFS scalabilty

2

AltoScale Contents

Background
Motivation
Architecture
Main problems and solutions
Bootstrapping
Namespace Partitioning
Rename

3

AltoScale Giraffa

Giraffa is a distributed,
highly available file system

Utilizes features of
HDFS and HBase

New open source project
in experimental stage

4

AltoScale Origin:

Giraffe. Latin: Giraffa camelopardalis
Family Giraffidae
Genus Giraffa
Species Giraffa camelopardalis

Other languages
Arabic Zarafa
Spanish Jirafa
Bulgarian жирафа
Italian Giraffa

Favorites of my daughter
o As the Hadoop traditions require

5

AltoScale Apache Hadoop

A reliable, scalable, high performance distributed
computing system
The Hadoop Distributed File System (HDFS)
Reliable storage layer

MapReduce – distributed computation framework
Simple computational model

Hadoop scales computation capacity, storage capacity,
and I/O bandwidth by adding commodity servers.

6

AltoScale The Design Principles

Linear scalability
More nodes can do more work within the same time
On Data size and Compute resources

Reliability and Availability
1 drive fails in 3 years. Probability of failing today 1/1000.
Several drives fail on a cluster with thousands of drives

Move computation to data
Minimize expensive data transfers

Sequential data processing
Avoid random reads

7

AltoScale Collocated Hadoop Clusters

HDFS – a distributed file system
NameNode – namespace and block management
DataNodes – block replica container

MapReduce – a framework for distributed computations
JobTracker – job scheduling, resource management, lifecycle
coordination
TaskTracker – task execution module

NameNode JobTracker

TaskTracker TaskTracker TaskTracker

DataNode DataNode DataNode

8

AltoScale Hadoop Distributed File System

The namespace is a hierarchy of files and directories
Files are divided into large blocks 128 MB

Namespace (metadata) is decoupled from data
Fast namespace operations, not slowed down by
Direct data streaming from the source storage

Single NameNode keeps the entire name space in RAM
DataNodes store block replicas as files on local drives
Blocks replicated on 3 DataNodes for redundancy & availability

HDFS client – point of entry to HDFS
Contacts NameNode for metadata
Serves data to applications directly from DataNodes
9

AltoScale Scalability Limits

Single-master architecture: a constraining resource
NameNode space limit
100 million files and 200 million blocks with 64GB RAM
Restricts storage capacity to 20 PB
Small file problem: block-to-file ratio is shrinking

Single NameNode limits linear performance growth
A handful of clients can saturate NameNode

MapReduce framework scalability limit: 40,000 clients
Corresponds to a 4,000-node cluster with 10 MapReduce slots

“HDFS Scalability: The limits to growth” USENIX ;login: 2010

10

AltoScale Horizontal to Vertical Scaling

Horizontal scaling is limited by single-master architecture
Natural growth of compute power and storage density
Clusters composed of more powerful servers

Vertical scaling leads to cluster size shrinking
Storage capacity, Compute power, and Cost
remain constant

11

AltoScale Shrinking Clusters

2008 Yahoo!
Resources per node: Cores, Disks, RAM

4000-node cluster
2010 Facebook
2000 nodes
2011 eBay
1000 nodes
2013 Cluster of
500 nodes
Cluster Size: Number of Nodes

12

AltoScale Scalability for Hadoop 2.0

HDFS Federation
Independent NameNodes sharing a common pool of DataNodes
Cluster is a family of volumes with shared block storage layer
User sees volumes as isolated file systems
ViewFS: the client-side mount table

Yarn: New MapReduce framework
Dynamic partitioning of cluster resources: no fixed slots
Separation of JobTracker functions
1. Job scheduling and resource allocation: centralized
2. Job monitoring and job life-cycle coordination: decentralized
o Delegate coordination of different jobs to other nodes

13

AltoScale Namespace Partitioning

Static: Federation
Directory sub-trees are statically assigned to
disjoint volumes
Relocating sub-trees without copying is
challenging
Scale x10: billions of files
Dynamic:
Files, directory sub-trees can move automatically
between nodes based on their utilization or load
balancing requirements
Files can be relocated without copying data blocks
Scale x100: 100s of billion of files
Orthogonal independent approaches.
Federation of distributed namespaces is possible

14

AltoScale Distributed Namespaces Today

Ceph
Metadata stored on OSD
MDS cache metadata: Dynamic Partitioning

Lustre
Plans to release (2.4) distributed namespace
Code ready

Colossus: from Google S.Quinlan and J.Dean
100 million files per metadata server
Hundreds of servers

VoldFS, CassandraFS, KTHFS (MySQL)
Prototypes

15

AltoScale HBase Overview

Table: big, sparse, loosely structured
Collection of rows, sorted by row keys
Rows can have arbitrary number of columns

Table is split Horizontally into Regions
Dynamic Table partitioning!
Region Servers serve regions to applications

Columns grouped into Column families
Vertical partition of tables

Distributed Cache: Regions are loaded in nodes’ RAM
Real-time access to data

16

AltoScale HBase API

HBaseAdmin: administrative functions
Create, delete, list tables
Create, update, delete columns, column families
Split, compact, flush

HTable: access table data
Result HTable.get(Get g) // get cells of a row
void HTable.put(Put p) // update a row
void HTable.put(Put[] p) // batch update of rows
void HTable.delete(Delete d) // delete cells/row
ResultScanner getScanner(family) // scan col family

Coprocessors:
Custom actions triggered by update events
Like database triggers and stored procedures

17

AltoScale HBase Architecture

18

AltoScale Giraffa File System

HDFS + HBase = Giraffa
Goal: build from existing building blocks
Minimize changes to existing components

1. Store file & directory metadata in HBase table
Dynamic table partitioning into regions
Cashed in RegionServer RAM for fast access

2. Store file data in HDFS DataNodes: data streaming
3. Block management
Handle communication with DataNodes
Perform block replication

19

AltoScale Giraffa Requirements

More files & more data
Availability
Load balancing of metadata traffic
Same data streaming speed to / from DataNodes
No SPOF

Cluster operability, management
Cost of running larger clusters same as smaller ones

HDFS Federated HDFS Giraffa
Space 25 PB 120 PB 1 EB = 1000 PB
Files + blocks 200 million 1 billion 100 billion
Concurrent Clients 40,000 100,000 1 million

20

AltoScale FAQ: Why HDFS and HBase?

Building new FS from scratch – Really hard, Takes years
HDFS a reliable, scalable block storage
Efficient Data Streaming
Automatic Data Recovery

HBase a natural metadata service
Distributed Cache …
Dynamic Partitioning
Automatic Metadata Recovery

Same breed, should be “compatible”
HBase stores data in HDFS: same storage for data and metadata

21

AltoScale
FAQ: Why not store
whole files in HBase tables?

Defeats the main concept of Distributed File Systems:
Decoupling of data and metadata

Small files can be stored as rows
Row size is limited by Region size
Large files must be split

Technically possible to split any information into rows
o Log files: into events
o Video files: into frames
o Random bits: into 1K blobs with an offset as a row key
Different level of abstraction
Requires data conversion

22

AltoScale
FAQ: My Dataset is Only 1 PB
Do I Still Need Giraffa?

Availability
Distributed access to namespace for many concurrent clients
Not bottlenecked by single NameNode performance

“Small files”
Block-to-file ration is decreasing: 2 –> 1.5 -> 1.2
No need to aggregate small files into large archives

23

AltoScale Building Blocks

Single Table called “Namespace” stores
File ID (row key) and file attributes:
o name, replication, block-size, permissions, times
List of blocks
Block locations
Giraffa client: FileSystem implementation
Obtains metadata from HBase
Data exchange with DataNodes
Block manager: maintain flat namespace of blocks
Block allocation, replication, removal
DataNode management
Storage for the HBase table
24

AltoScale Giraffa Architecture

HBase

Namespace 1. Giraffa client
path, attrs, block[], DN[][], BM-node
gets files
and blocks
1 Block Management Agent from HBase
2. May directly
query Block
NamespaceAgent

App 2 Block Management Layer Manager
3. Stream data
BM BM BM
3
to or from
DataNodes
DN DN DN
DN DN DN
DN DN DN

25

AltoScale Namespace Table

Row keys
Identify files and directories as rows in the table
Different key definitions based on locality requirement
Key definition is chosen during formatting of the file system
Full-path-key is the default
Columns
File attributes:
o Local name, owner, group, permissions, access-time,
modification-time, block-size, replication, isDir, length
List of blocks of a file
o Persisted in the table
List of block locations for each block
o Not persisted, but discovered from block reports
Directory table maps dir-entry name to corresponding row key

26

AltoScale Giraffa Client

GiraffaFileSystem implements FileSystem
fs.defaultFS = grfa:///
fs.grfa.impl = o.a.giraffa.GiraffaFileSystem

GiraffaClient extends DFSClient
NamespaceAgent replaces NameNode RPC

Namespace
GiraffaFileSystem

Agent
GiraffaClient

DFSClient

to NameNode

to DataNodes

27

AltoScale Block Management

Block Manager
Block allocation, deletion, replication

DataNode Manager
Process DataNode block reports, heartbeats. Identify lost nodes

Provide storage for HBase table
Small file system to store HFiles

BMServer paired on the same node with RegionServer
Distributed cluster of BMServes
Mostly local communication between Region and BM servers

NameNode is an initial implementation of BMServer
Giraffa block is a single block file with the same name as block id

28

AltoScale Three Problems

Bootstrapping
HBase stores tables as files in HDFS

Namespace Partitioning
Retain locality

Atomic Renames

29

AltoScale Bootstrapping

Block Manager Server

.log HBase Volume
hbase/ region1 Table layout
/ Rare updates
giraffa/ region2

blk_123_001 dn-1 dn-2 dn-3
Block Volume
blk_234_002 dn-11 dn-12 dn-13 Flat namespace
of blocks
blk_345_003 dn-101 dn-102 dn-103

30

AltoScale Locality of Reference

Row keys
Define sorting of files and directories in the table
Tree structured namespace is flattened into linear array
Ordered list of files is self-partitioned into regions
Retain locality in linearized structure
Files in the same directory - adjacent in the table
Belong to the same region with some exclusions
Files of the same directory should be on the same node
Avoid jumping cross regions for simple “ls”

31

AltoScale Partitioning Example 1

Straightforward partitioning based on random hashing

1

2 3 4

1 1
5 6

T1 T2 T3 T4

id1 id2 id3

32


Partitioning based on lexicographic full-path ordering
The default

1

2 3 4

15 16

T1 T2 T3 T4

1 1 1 1 1
2 2
T1 T2 3
T3 4
T4
15

33


Partitioning based on fixed depth neighborhoods

1

2 3 4

1 1
5 6

T1 T2 T3 T4

1 1 1 1 2 2
2 3 4 15 16

34

AltoScale Atomic Rename

Giraffa will implement atomic in-place rename
No support for atomic file move from one directory to another

A move can then be implemented on application level
Non-atomically move the target file from the source directory to a
temporary file in the target directory
Atomically rename the temporary file to its original name
On failure use simple 3-step recovery procedure

Eventually implement atomic moves
PAXOS
Simplified synchronization algorithms

35

AltoScale History

(2008) Idea. Study of distributed systems
AFS, Lustre, Ceph, PVFS, GPFS, Farsite, …
Partitioning of the namespace: 4 types of partitioning

(2009) Study on scalability limits
NameNode optimization

(2010) Design with Michael Stack
Presentation at HDFS contributors meeting

(2011) Plamen implements POC
(2012) Rewrite open sourced as Apache Extras project
http://code.google.com/a/apache-extras.org/p/giraffa/

36

AltoScale Status

 Design stage
 One node cluster running
Live demo with Plamen

37

Dynamic Namespace Partitioning with Giraffa File System

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Dynamic Namespace Partitioning with Giraffa File System

Similar to Dynamic Namespace Partitioning with Giraffa File System (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Dynamic Namespace Partitioning with Giraffa File System