Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Dynamic Namespace Partitioning with Giraffa File System
1. Dynamic Namespace Partitioning with
The Giraffa File System
Konstantin V. Shvachko Plamen Jeliazkov
Founder, Altoscale UC San Diego
June 14, 2012 Hadoop Summit 2012
AltoScale
2. AltoScale Introduction
Plamen
Fresh grad from UCSD
Internship with Hadoop Platform Team at eBay
Wrote Giraffa prototype
Konstantin
Founder of Altoscale. Primary focus
1. Altoscale Workbench
Hadoop & HBase cluster on a public or a private cloud
2. Giraffa
Apache Hadoop PMC
HDFS scalabilty
2
4. AltoScale Giraffa
Giraffa is a distributed,
highly available file system
Utilizes features of
HDFS and HBase
New open source project
in experimental stage
4
5. AltoScale Origin:
Giraffe. Latin: Giraffa camelopardalis
Family Giraffidae
Genus Giraffa
Species Giraffa camelopardalis
Other languages
Arabic Zarafa
Spanish Jirafa
Bulgarian жирафа
Italian Giraffa
Favorites of my daughter
o As the Hadoop traditions require
5
6. AltoScale Apache Hadoop
A reliable, scalable, high performance distributed
computing system
The Hadoop Distributed File System (HDFS)
Reliable storage layer
MapReduce – distributed computation framework
Simple computational model
Hadoop scales computation capacity, storage capacity,
and I/O bandwidth by adding commodity servers.
6
7. AltoScale The Design Principles
Linear scalability
More nodes can do more work within the same time
On Data size and Compute resources
Reliability and Availability
1 drive fails in 3 years. Probability of failing today 1/1000.
Several drives fail on a cluster with thousands of drives
Move computation to data
Minimize expensive data transfers
Sequential data processing
Avoid random reads
7
9. AltoScale Hadoop Distributed File System
The namespace is a hierarchy of files and directories
Files are divided into large blocks 128 MB
Namespace (metadata) is decoupled from data
Fast namespace operations, not slowed down by
Direct data streaming from the source storage
Single NameNode keeps the entire name space in RAM
DataNodes store block replicas as files on local drives
Blocks replicated on 3 DataNodes for redundancy & availability
HDFS client – point of entry to HDFS
Contacts NameNode for metadata
Serves data to applications directly from DataNodes
9
10. AltoScale Scalability Limits
Single-master architecture: a constraining resource
NameNode space limit
100 million files and 200 million blocks with 64GB RAM
Restricts storage capacity to 20 PB
Small file problem: block-to-file ratio is shrinking
Single NameNode limits linear performance growth
A handful of clients can saturate NameNode
MapReduce framework scalability limit: 40,000 clients
Corresponds to a 4,000-node cluster with 10 MapReduce slots
“HDFS Scalability: The limits to growth” USENIX ;login: 2010
10
11. AltoScale Horizontal to Vertical Scaling
Horizontal scaling is limited by single-master architecture
Natural growth of compute power and storage density
Clusters composed of more powerful servers
Vertical scaling leads to cluster size shrinking
Storage capacity, Compute power, and Cost
remain constant
11
12. AltoScale Shrinking Clusters
2008 Yahoo!
Resources per node: Cores, Disks, RAM
4000-node cluster
2010 Facebook
2000 nodes
2011 eBay
1000 nodes
2013 Cluster of
500 nodes
Cluster Size: Number of Nodes
12
13. AltoScale Scalability for Hadoop 2.0
HDFS Federation
Independent NameNodes sharing a common pool of DataNodes
Cluster is a family of volumes with shared block storage layer
User sees volumes as isolated file systems
ViewFS: the client-side mount table
Yarn: New MapReduce framework
Dynamic partitioning of cluster resources: no fixed slots
Separation of JobTracker functions
1. Job scheduling and resource allocation: centralized
2. Job monitoring and job life-cycle coordination: decentralized
o Delegate coordination of different jobs to other nodes
13
14. AltoScale Namespace Partitioning
Static: Federation
Directory sub-trees are statically assigned to
disjoint volumes
Relocating sub-trees without copying is
challenging
Scale x10: billions of files
Dynamic:
Files, directory sub-trees can move automatically
between nodes based on their utilization or load
balancing requirements
Files can be relocated without copying data blocks
Scale x100: 100s of billion of files
Orthogonal independent approaches.
Federation of distributed namespaces is possible
14
15. AltoScale Distributed Namespaces Today
Ceph
Metadata stored on OSD
MDS cache metadata: Dynamic Partitioning
Lustre
Plans to release (2.4) distributed namespace
Code ready
Colossus: from Google S.Quinlan and J.Dean
100 million files per metadata server
Hundreds of servers
VoldFS, CassandraFS, KTHFS (MySQL)
Prototypes
15
16. AltoScale HBase Overview
Table: big, sparse, loosely structured
Collection of rows, sorted by row keys
Rows can have arbitrary number of columns
Table is split Horizontally into Regions
Dynamic Table partitioning!
Region Servers serve regions to applications
Columns grouped into Column families
Vertical partition of tables
Distributed Cache: Regions are loaded in nodes’ RAM
Real-time access to data
16
17. AltoScale HBase API
HBaseAdmin: administrative functions
Create, delete, list tables
Create, update, delete columns, column families
Split, compact, flush
HTable: access table data
Result HTable.get(Get g) // get cells of a row
void HTable.put(Put p) // update a row
void HTable.put(Put[] p) // batch update of rows
void HTable.delete(Delete d) // delete cells/row
ResultScanner getScanner(family) // scan col family
Coprocessors:
Custom actions triggered by update events
Like database triggers and stored procedures
17
19. AltoScale Giraffa File System
HDFS + HBase = Giraffa
Goal: build from existing building blocks
Minimize changes to existing components
1. Store file & directory metadata in HBase table
Dynamic table partitioning into regions
Cashed in RegionServer RAM for fast access
2. Store file data in HDFS DataNodes: data streaming
3. Block management
Handle communication with DataNodes
Perform block replication
19
20. AltoScale Giraffa Requirements
More files & more data
Availability
Load balancing of metadata traffic
Same data streaming speed to / from DataNodes
No SPOF
Cluster operability, management
Cost of running larger clusters same as smaller ones
HDFS Federated HDFS Giraffa
Space 25 PB 120 PB 1 EB = 1000 PB
Files + blocks 200 million 1 billion 100 billion
Concurrent Clients 40,000 100,000 1 million
20
21. AltoScale FAQ: Why HDFS and HBase?
Building new FS from scratch – Really hard, Takes years
HDFS a reliable, scalable block storage
Efficient Data Streaming
Automatic Data Recovery
HBase a natural metadata service
Distributed Cache …
Dynamic Partitioning
Automatic Metadata Recovery
Same breed, should be “compatible”
HBase stores data in HDFS: same storage for data and metadata
21
22. AltoScale
FAQ: Why not store
whole files in HBase tables?
Defeats the main concept of Distributed File Systems:
Decoupling of data and metadata
Small files can be stored as rows
Row size is limited by Region size
Large files must be split
Technically possible to split any information into rows
o Log files: into events
o Video files: into frames
o Random bits: into 1K blobs with an offset as a row key
Different level of abstraction
Requires data conversion
22
23. AltoScale
FAQ: My Dataset is Only 1 PB
Do I Still Need Giraffa?
Availability
Distributed access to namespace for many concurrent clients
Not bottlenecked by single NameNode performance
“Small files”
Block-to-file ration is decreasing: 2 –> 1.5 -> 1.2
No need to aggregate small files into large archives
23
24. AltoScale Building Blocks
Single Table called “Namespace” stores
File ID (row key) and file attributes:
o name, replication, block-size, permissions, times
List of blocks
Block locations
Giraffa client: FileSystem implementation
Obtains metadata from HBase
Data exchange with DataNodes
Block manager: maintain flat namespace of blocks
Block allocation, replication, removal
DataNode management
Storage for the HBase table
24
25. AltoScale Giraffa Architecture
HBase
Namespace 1. Giraffa client
path, attrs, block[], DN[][], BM-node
gets files
and blocks
1 Block Management Agent from HBase
2. May directly
query Block
NamespaceAgent
App 2 Block Management Layer Manager
3. Stream data
BM BM BM
3
to or from
DataNodes
DN DN DN
DN DN DN
DN DN DN
25
26. AltoScale Namespace Table
Row keys
Identify files and directories as rows in the table
Different key definitions based on locality requirement
Key definition is chosen during formatting of the file system
Full-path-key is the default
Columns
File attributes:
o Local name, owner, group, permissions, access-time,
modification-time, block-size, replication, isDir, length
List of blocks of a file
o Persisted in the table
List of block locations for each block
o Not persisted, but discovered from block reports
Directory table maps dir-entry name to corresponding row key
26
28. AltoScale Block Management
Block Manager
Block allocation, deletion, replication
DataNode Manager
Process DataNode block reports, heartbeats. Identify lost nodes
Provide storage for HBase table
Small file system to store HFiles
BMServer paired on the same node with RegionServer
Distributed cluster of BMServes
Mostly local communication between Region and BM servers
NameNode is an initial implementation of BMServer
Giraffa block is a single block file with the same name as block id
28
29. AltoScale Three Problems
Bootstrapping
HBase stores tables as files in HDFS
Namespace Partitioning
Retain locality
Atomic Renames
29
31. AltoScale Locality of Reference
Row keys
Define sorting of files and directories in the table
Tree structured namespace is flattened into linear array
Ordered list of files is self-partitioned into regions
Retain locality in linearized structure
Files in the same directory - adjacent in the table
Belong to the same region with some exclusions
Files of the same directory should be on the same node
Avoid jumping cross regions for simple “ls”
31
32. AltoScale Partitioning Example 1
Straightforward partitioning based on random hashing
1
2 3 4
1 1
5 6
T1 T2 T3 T4
id1 id2 id3
32
33. AltoScale Partitioning Example 2
Partitioning based on lexicographic full-path ordering
The default
1
2 3 4
15 16
T1 T2 T3 T4
1 1 1 1 1
2 2
T1 T2 3
T3 4
T4
15
33
34. AltoScale Partitioning Example 3
Partitioning based on fixed depth neighborhoods
1
2 3 4
1 1
5 6
T1 T2 T3 T4
1 1 1 1 2 2
2 3 4 15 16
34
35. AltoScale Atomic Rename
Giraffa will implement atomic in-place rename
No support for atomic file move from one directory to another
A move can then be implemented on application level
Non-atomically move the target file from the source directory to a
temporary file in the target directory
Atomically rename the temporary file to its original name
On failure use simple 3-step recovery procedure
Eventually implement atomic moves
PAXOS
Simplified synchronization algorithms
35
36. AltoScale History
(2008) Idea. Study of distributed systems
AFS, Lustre, Ceph, PVFS, GPFS, Farsite, …
Partitioning of the namespace: 4 types of partitioning
(2009) Study on scalability limits
NameNode optimization
(2010) Design with Michael Stack
Presentation at HDFS contributors meeting
(2011) Plamen implements POC
(2012) Rewrite open sourced as Apache Extras project
http://code.google.com/a/apache-extras.org/p/giraffa/
36
37. AltoScale Status
Design stage
One node cluster running
Live demo with Plamen
37